Min Li - Authorea

Pyramid Temporal Hierarchy Network (PTH-Net) is a new paradigm for dynamic facial expression recognition, applied directly to raw videos without face detection and alignment (FDA). The traditional paradigm initially employs FDA to extract facial regions from raw videos before recognition. The advantage of this paradigm lies in minimizing the impact of complex backgrounds. However, it inadvertently neglects valuable information, such as body movements. Additionally, being bound to FDA sacrifices flexibility. In contrast, PTH-Net distinguishes background and target at the feature level, preserves more critical information, and is an end-to-end network that is more flexible. Specifically, PTH-Net utilizes a pre-trained backbone to extract multiple generic features of video understanding at various temporal frequencies, forming pyramid features. Subsequently, through temporal hierarchy refinement—achieved via differential sharing and downsampling—PTH-Net refines key information under the supervision of multiple receptive fields with the temporal-frequency invariance of expressions. In addition, to solve the problem of containing numerous irrelevant frames in videos, PTH-Net incorporates a Temporal Hierarchy Refinement layer to aggregate information at different temporal granularities, enhancing its ability to distinguish target and non-target expressions. Notably, PTH-Net achieves more comprehensive and in-depth understanding by merging knowledge from both forward and reverse video sequences. PTH-Net excels across six challenging benchmarks with lower computational costs in comparison to preceding methods.