Cross-modal learning representation using new margin combination for speech recognition task
Main Article Content
Abstract
Cross-modal retrieval aims to elucidate the fusion of information, mimic human learning, and advance the field. The main challenge in cross-modal matching is to build a shared subspace reflecting semantic proximity. Previous works fail to capture asymmetric relevance by adopting symmetric similarity computations. To overcome these shortcomings, an efficient approach called quaternion representation learning (QRL) is introduced for better cross-modal matching. Thus, a better representation of the shared semantics is offered by virtue of its richer representation capacity of the quaternionic space and its strong expressive power. Transfer learning is a crucial aspect in this context. By leveraging pre-trained models, the knowledge gained from one task or domain can be effectively transferred to another, allowing for improved performance and generalization. In this study, transfer learning is employed to enhance the cross-modal retrieval system. Specifically, a pre-trained ResNet-512 model is utilized in conjunction with the proposed total margin (TM) loss function, which combines the QRL approach with the novel adaptive mean margin (AMM) methodology. The TM loss function, coupled with the pre-trained ResNet-512 model, is evaluated on the Audio-Visual Arabic Speech Database (AVAS) and the Arabic Visual Speech Database (AVSD), along with other audio-visual datasets. Experimental results demonstrate the effectiveness of the TM loss function in consistently improving performance on both databases. The recall scores (R@k) and mean average precision (mAP) values achieved on the AVAS Database are as follows: R@1: 42.1±0.7, R@2: 70.2±0.1, R@5: 78.5±1.0, and mAP: 53.0±1.1. Similarly, on the AVSD Database, the results are R@1: 41.7±0.3, R@2: 69.2±1.1, R@5: 78.0±0.3, and mAP: 52.7±0.5. By incorporating transfer learning and the TM loss function into the cross-modal retrieval framework, this study demonstrates the potential for improving clustering efficiency and enhancing visual and speech understanding. The combination of pre-trained models and the TM loss function offers a promising avenue for advancing crossmodal matching techniques and achieving state-of-the-art performance.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.