KD HPE

Knowledge Distillation with Global Filters for Efficient Human Pose Estimation

Bio-AIm Lab, School of Computing Science
University of Glasgow
BMVC 2024

Presentation

Play Button

Abstract

Efficient and accurate 2D human pose estimation (2D-HPE) remains a critical chal- lenge that must be overcome to enable its use on resource-constrained devices. This paper introduces a novel framework that synergizes knowledge distillation with Global Filter Layers (GFL) to enable efficient and scalable 2D human pose estimation. Our ap- proach leverages the power of a high-capacity heatmap network to train a lightweight student network. This student network employs global spectral filters as an alterna- tive to attention-based token mixer enabling lower computational complexity and higher throughput. We specifically propose this approach on coordinate classification and regress -ion-based 2D-HPE methods owing to their higher speed compared to heatmap models. We extensively evaluate our approach on MPII dataset with both regression and coor- dinate classification student networks and different filter weighting strategies. While our model is lightweight, it achieves about 18% increase in throughput speed and with 89.40PCKh@0.5 accuracy closes the performance gap with large state-of-the-art 2D- HPE models.

The main contributions of our work are as follows:

  1. Introduce a novel distillation framework between a heatmap-based teacher model and a coordinate classification student network, which significantly reduces the performance gap between large Human Pose Estimation (HPE) models and lightweight counterparts.
  2. Investigate the integration of Global Filter Layers (GFLs) as a substitute for the computationally intensive Self-Attention modules in student networks, aiming to achieve faster inference speeds.
  3. Examine the effects of implementing Dynamic and Static Weighting strategies within Global Filter Layers.
We conduct extensive evaluations of our approach on the MPII dataset, a widely used benchmark for HPE algorithms. Our results demonstrate that GFLs, under different weighting strategies, can achieve significantly higher speeds compared to conventional Self-Attention modules in transformers. Specifically, our FFT-B-GS model achieves approximately an 18.26% increase in speed (fps) over attention-based models within the coordinate classification paradigm and about a 24.15% improvement in regression tasks.

Architecture Overview

Results of 1 Image-Per-Class

Comparative Results

Results of 1 Image-Per-Class

Qualitative Results

Global Semantics

Video Results

Poster BMVC 2024