Using a DCT-driven Loss in Attention-based Knowledge-Distillation for Scene Recognition

Alejandro López-Cifuentes, Marcos Escudero-Viñolo, Jesús Bescós, and Juan C. SanMiguel

Video Processing and Understanding Lab (VPU Lab), Universidad Autónoma de Madrid

Official project site of Using a DCT-driven Loss in Attention-based Knowledge-Distillation for Scene Recognition (Under review in Elsevier Pattern Recognition).

Abstract

Knowledge Distillation (KD) is a strategy for the definition of a set of transferability gangways to improve the efficiency of Convolutional Neural Networks. Feature-based Knowledge Distillation is a subfield of KD that relies on intermediate network representations, either unaltered or depth-reduced via maximum activation maps, as the source knowledge.

In this paper, we propose and analyse the use of a 2D frequency transform of the activation maps before transferring them. We pose that—by using global image cues rather than pixel estimates, this strategy enhances knowledge transferability in tasks such as scene recognition, defined by strong spatial and contextual relationships between multiple and varied concepts.

To validate the proposed method, a novel and extensive evaluation of the state-of-the-art in scene recognition is presented. Experimental results provide strong evidences that the proposed strategy enables the student network to better focus on the relevant image areas learnt by the teacher network, hence leading to better descriptive features and higher transferred performance than every other state-of-the-art alternative.

Proposed Method

Example of the proposed Knowledge-Distillation gangways between two ResNet architectures representing the teacher and the student models. In this case, the intermediate feature representations for the Distillation Knowledge are extracted from the basic Residual Blocks. We propose a novel matching approach based on a 2D discrete linear transform to the activation maps. This novel technique, for which we here leverage the simple yet effective Discrete Cosine Transform (DCT), allows to compare the 2D relationships captured by the transformed coefficients. In the proposed approach, the matching is moved from a pixel-to-pixel fashion to 65 a correlation in the frequency domain, where each of the coefficients integrates spatial information from the whole image.

Results

State-of-the-art Results

Reporting strong evidences that the proposed DCT-based metric enables the student network to better focus on the relevant image areas learnt by the teacher model, hence increasing the overall performance for Scene 95 Recognition.

Qualitative Activation Maps Results

Example of the obtained activation maps at different levels of depth. Top rows represent activation maps for vanilla ResNet-18 and ResNet-50 CNNs respectively. Bottom row represents the activation maps obtained by the proposed DCT Attention-based KD method when ResNet-50 acts as the teacher network and ResNet-18 acts as the student. AT activation maps are also included for comparison.

Citation

If you find this work useful, please consider citing:

López-Cifuentes, A., Escudero-Viñolo, M., Bescós, J., & San Miguel, J. C. (2022). Using a DCT-driven Loss in Attention-based Knowledge-Distillation for Scene Recognition.

 
  @InProceedings{Lopez2022using,
      author="L{\'o}pez-Cifuentes, Alejandro and Escudero-Vi{\~{n}}olo, Marcos and Besc{\'o}s, Jes{\'u}s and San Miguel, Juan Carlos",
      title="Using a DCT-driven Loss in Attention-based Knowledge-Distillation for Scene Recognition",
  }
   				

Acknowledgement: This study has been partially supported by the Spanish Government through its TEC2017-88169-R MobiNetVideo project.