HCMT: A Novel Hierarchical Cross-Modal Transformer for Recognition of Abnormal Behavior
Document Type
Article
Publication Date
1-1-2024
Abstract
Enhancing video recognition systems with advanced abnormal behavior recognition technologies is crucial for school safety and campus security. Traditional methods primarily rely on visual data and often fail to recognize complex behaviors due to intricate backgrounds. Similarly, traditional audio processing techniques struggle to capture transient anomalies, as they have limited capacity to handle complex sounds. This study overcomes these limitations by integrating audio and visual data, addressing the shortcomings of visual-only modalities in recognizing subtle behaviors. This study introduces a novel Hierarchical Cross-Modal Transformer (HCMT), which innovatively combines multiple hierarchical branches of visual and audio. The innovative integration of hierarchical audio and visual modalities in HCMT enables capturing low-level features often overlooked by single late-stage fusion methods, thus learning global features more effectively. The audio branch utilizes the newly developed Audio Temporal Spectrogram Transformer (ATST), which employs a global sparse uniform sampling technique to effectively capture the transient nature of audio-based abnormalities, thereby enhancing behavior recognition robustness. The HCMT model demonstrated a Top-1 accuracy of 79.45% and a Top-5 accuracy of 98.44% on the challenging Campus Abnormal Behavior Recognition Hard (CABRH8) dataset, consisting of eight indistinguishable human abnormal behaviors. The ATST significantly improved Top-1 accuracy by 7.45% over visual benchmarks alone. Furthermore, the HCMT recorded Top-1 and Top-5 accuracies of 84.93% and 97.63% on the CABR50 dataset, outperforming prior models that relied solely on visual data. It underscores the adaptability of the HCMT approach. The model's complexity includes 992 GFLOPs, achieving 28 frames per second (FPS). The model's generalizability was also confirmed on additional datasets, including UCF-101, which achieved advanced outcomes.
Keywords
Visualization, Feature extraction, Transformers, Accuracy, Spatiotemporal phenomena, Adaptation models, Transient analysis, Computational modeling, Computational complexity, Behavioral sciences, Video recognition systems, campus abnormal behaviors, visual and audio branches, hierarchical cross-modal transformer, audio temporal spectrogram transformer
Divisions
sch_ecs
Funders
Chongqing Key Laboratory of Public Big Data Security Technology,Universiti Malaya (ST018-2023),Science and Technology Research Program of Chongqing Municipal Education Commission (KJQN202304017)
Publication Title
IEEE Access
Volume
12
Publisher
Institute of Electrical and Electronics Engineers
Publisher Location
445 HOES LANE, PISCATAWAY, NJ 08855-4141 USA