Position and Orientation Aware One-Shot Learning for Medical Action Recognition from Signal Data

A position and orientationaware one-shot learning framework for medical action recognition from signal data

Abstract

In this work, we propose a position and orientation-aware one-shot learning framework for medical action recognition from signal data. The proposed framework comprises two stages and each stage includes signal-level image generation (SIG), cross-attention (CsA), and dynamic time warping (DTW) modules and the information fusion between the proposed privacy-preserved position and orientation features. The proposed SIG method aims to transform the raw skeleton data into privacy-preserved features for training. The CsA module is developed to guide the network in reducing medical action recognition bias and more focusing on important human body parts for each specific action, aimed at addressing similar medical action related issues. Moreover, the DTW module is employed to minimize temporal mismatching between instances and further improve model performance. Furthermore, the proposed privacy-preserved orientation-level features are utilized to assist the position-level features in both of the two stages for enhancing medical action recognition performance. Extensive experimental results on the widely-used and well-known NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets all demonstrate the effectiveness of the proposed method, which outperforms the other state-of-the-art methods with general dataset partitioning by 2.7%, 6.2% and 4.1%, respectively.

System Architecture Diagram - Figure 6
Figure 6: The single-stream illustration of the proposed one-shot learning approach. (1) The SIG module first transforms the input skeleton sequences into signal-level images before being fed into the ResNet18 encoder with the CsA module for feature extraction. (2) The CsA module enables information communication between the support and query sets, this allows the query instance to know which part is important to the support instance or vice versa. (3) The DTW module is exploited to address the temporal information mismatching issue via equations (6) and (7). The encoded features from both support and query sets are applied for metric learning in the ProtoNet framework [42]. The vectors from the support and query set are finally mapped to the feature space for similarity calculation and yield the final results.
Click to Enlarge
×

Proposed Method

In our proposed POA-OSL, the position and orientation features are firstly extracted from the raw skeleton sequences for a better representation of human actions. The SIG module is then utilized for mitigating the privacy information. After that, the CsA and the DTW modules are applied to address the similar action and temporal mismatching issue, respectively. Subsequently, the orientation feature-assisted training method is introduced which consists of multiple-level fusion. Finally, the ProtoNet based model is selected for obtaining the experimental results.

  1. Signal Images Transformation (Stage 1): Based on the human physical structures, we manually design the human bones from the landmarks in the raw skeleton sequences. In order to obtain different types of human action features from the raw skeleton sequences, the angles of the human bones are extracted as the orientation features for assisting the training process.
  2. Cross Attention Mechanism (Stage 2): fter transferring the skeleton sequence into the signal-level representation, a cross-attention module between the support set and query set is exploited in the proposed framework. In previous cross-attention approaches, typically, only one of the two modules involved in the computation was focused. The aim of cross-attention is to guide the network to give more attention to the important parts rather than the other parts. This mechanism could decrease the difficulties in discriminating against similar medical actions.
  3. Dynamic Time Warping (Stage 3): There exist several factors (e.g. different experimental subjects, speed, duration of the recording and action timing) that result in the temporal information mismatching issue between the support set and query set actions.
  4. Orientation-level Feature Assisted Training (Stage 3): Due to the raw data utilized, which only consists of coordinate position information for human landmarks, relying solely on a single-level feature for action recognition is insufficient to capture comprehensive action characteristics. Therefore, we propose an orientation-level assisted training approach to enhance the model performance in different stages, which primarily consists of both early fusion and late fusion methods.
  5. Abnormality Detection: In the final decision-making stage, the system makes a determination on the abnormality of events at the frame level. This is done based on the integrated normality scores from the ST-GCAE and the object detection module. An event is flagged as abnormal if it deviates significantly from the learned normal patterns.

Results

In this section, we conduct the experiments on three public datasets which are most widely used and well-known for action recognition tasks including NTU-60, NTU-120 and PKU-MMD. The quantitative results are also presented to compare with the other SOTA one-shot learning methods for human action recognition. Moreover, we design experiments for specific medical actions analyzing as well as the result visualisation. Furthermore, we carry out ablation studies to demonstrate the effectiveness of transformed features, the proposed CsA and DTW modules. Finally, the experiments for different parameter settings are presented.

System Architecture Diagram - Table V
Table V: The proposed POA-OSL performance on the specific classes on NTU-60, NTU-120 and PKU-MMD datasets for 5-way-1-shot medical action recognition with Top 1 Accuracy (%). The w/o CsA indicates the model only contains the DTW module and w/ CsA indicates the model contains both DTW and CsA modules.

As we can see from Table V, both cough and chest pain are the most difficult medical actions to recognize. The reason for this is that both the movements of these two medical actions are slight in both spatial and temporal dimensions. In contrast, the staggering and falling achieve the most and the second- most promising accuracy on the NTU-120 dataset, which are 95.7% and 99.5%, respectively. It could be observed that the performance of headache from PKU-MMD is remarkably improved from 50.1% to 69.7% by applying the proposed POA-OSL (MF), which further verifies that POA-OSL could enhance the discriminating ability on different datasets..

×

Contributors

Leiyu Xie
Leiyu Xie
Yuxing Yang
Yuxing Yang
Zeyu Fu
Zeyu Fu
Syed Mohsen Naqvi
Syed Mohsen Naqvi
@ARTICLE{10814994,
          author={Xie, Leiyu and Yang, Yuxing and Fu, Zeyu and Naqvi, Syed Mohsen},
          journal={IEEE Transactions on Multimedia}, 
          title={Position and Orientation Aware One-Shot Learning for Medical Action Recognition from Signal Data}, 
          year={2024},
          volume={},
          number={},
          pages={1-14},
          keywords={Feature extraction;Biomedical imaging;Skeleton;One shot learning;Training;Human activity recognition;Privacy;Data models;Protection;Data privacy;One-shot learning;medical action recognition;attention mechanism;feature fusion;healthcare},
          doi={10.1109/TMM.2024.3521703}} 
        

© 2025 Multimodal Intelligence Lab, UK | Department of Computer Science, University of Exeter