Cross-modal learning representation using new margin combination for speech recognition task

D. Karim; M. Abdelkarim

doi:10.22201/icat.24486736e.2024.22.3.2417

PDF

Published: Jun 28, 2024

DOI: https://doi.org/10.22201/icat.24486736e.2024.22.3.2417

Keywords:

™, cross-modal representation, quaternion approach, AVAS, AVSD

D. Karim

Research Unite of Analyse and Processing of Electrical and Energetic Systems, Faculty of Sciences of Tunis, University of Tunis El Manar, El Manar-Tunis 2092, Tunisia

https://orcid.org/0000-0002-2644-9776

M. Abdelkarim

Reseach Laboratory in Algebra, Numbers Theory and Intelligent Systems, Faculty of Sciences of Monastir, 90 Mohamed V Street, 1002 Monastir, Tunisia

https://orcid.org/0009-0008-9629-4002

Abstract

Cross-modal retrieval aims to elucidate the fusion of information, mimic human learning, and advance the field. The main challenge in cross-modal matching is to build a shared subspace reflecting semantic proximity. Previous works fail to capture asymmetric relevance by adopting symmetric similarity computations. To overcome these shortcomings, an efficient approach called quaternion representation learning (QRL) is introduced for better cross-modal matching. Thus, a better representation of the shared semantics is offered by virtue of its richer representation capacity of the quaternionic space and its strong expressive power. Transfer learning is a crucial aspect in this context. By leveraging pre-trained models, the knowledge gained from one task or domain can be effectively transferred to another, allowing for improved performance and generalization. In this study, transfer learning is employed to enhance the cross-modal retrieval system. Specifically, a pre-trained ResNet-512 model is utilized in conjunction with the proposed total margin (TM) loss function, which combines the QRL approach with the novel adaptive mean margin (AMM) methodology. The TM loss function, coupled with the pre-trained ResNet-512 model, is evaluated on the Audio-Visual Arabic Speech Database (AVAS) and the Arabic Visual Speech Database (AVSD), along with other audio-visual datasets. Experimental results demonstrate the effectiveness of the TM loss function in consistently improving performance on both databases. The recall scores (R@k) and mean average precision (mAP) values achieved on the AVAS Database are as follows: R@1: 42.1±0.7, R@2: 70.2±0.1, R@5: 78.5±1.0, and mAP: 53.0±1.1. Similarly, on the AVSD Database, the results are R@1: 41.7±0.3, R@2: 69.2±1.1, R@5: 78.0±0.3, and mAP: 52.7±0.5. By incorporating transfer learning and the TM loss function into the cross-modal retrieval framework, this study demonstrates the potential for improving clustering efficiency and enhancing visual and speech understanding. The combination of pre-trained models and the TM loss function offers a promising avenue for advancing crossmodal matching techniques and achieving state-of-the-art performance.

How to Cite

Karim, D., & Abdelkarim, M. (2024). Cross-modal learning representation using new margin combination for speech recognition task. Journal of Applied Research and Technology, 22(3), 451–470. https://doi.org/10.22201/icat.24486736e.2024.22.3.2417

Issue

Vol. 22 No. 3 (2024)

Section

Articles

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Author Biography

M. Abdelkarim, Reseach Laboratory in Algebra, Numbers Theory and Intelligent Systems, Faculty of Sciences of Monastir, 90 Mohamed V Street, 1002 Monastir, Tunisia

Research Laboratory in Algebra, Numbers Theory and Intelligent Systems

Article Sidebar

Main Article Content

Abstract

Article Details

M. Abdelkarim, Reseach Laboratory in Algebra, Numbers Theory and Intelligent Systems, Faculty of Sciences of Monastir, 90 Mohamed V Street, 1002 Monastir, Tunisia