A Survey of Speech and Text-Based Emotion Recognition_Vol. 9. SSFT2025_Conferences

Home > Conferences > Vol. 9. SSFT2025 >

A Survey of Speech and Text-Based Emotion Recognition

Download PDF

DOI: https://doi.org/10.62381/ACS.SSFS2025.05

Author(s)

Zhe Liu, Ruiyu Liu

Affiliation(s)

Maynooth international college, Fuzhou university, Fuzhou, Fujian, China

Abstract

Affective recognition serves as a core technology for anthropomorphic interaction in artificial intelligence. Traditional single-modal approaches (relying solely on speech or text) are limited by issues such as incomplete feature representation and environmental sensitivity. Multimodal sentiment analysis, which integrates speech prosody (e.g., pitch, speech rate) and text semantics (e.g., emotional vocabulary, syntactic structures), significantly enhances recognition accuracy and robustness. This paper systematically reviews the research progress in this field.

Keywords

Emotion Analysis; Multimodal Sentiment Analysis; Deep Learning; Multimodal Fusion; Emotion Representation Model; Cross-Domain Generalization

References

[1] A. Zadeh, R. Zellers, E. Pincus and L. -P. Morency, "Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages," in IEEE Intelligent Systems, vol. 31, no. 6, pp. 82-88, Nov.-Dec. 2016, doi: 10.1109/MIS.2016.94. [2] T. Baltrušaitis, C. Ahuja and L. -P. Morency, "Multimodal Machine Learning: A Survey and Taxonomy," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423-443, 1 Feb. 2019, doi: 10.1109/TPAMI.2018.2798607. [3] TAO Jianhua, FAN Cunhang, LIAN Zheng, et al. (2024). Current developments and trends in multimodal emotion recognition and understanding. Journal of Image and Graphics, 29 (6), 1607–1627. https://doi.org/10.11834/jig.240017 [4] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103–1114, Copenhagen, Denmark. Association for Computational Linguistics. [5] Busso C , Bulut M , Lee C C ,et al.IEMOCAP: interactive emotional dyadic motion capture database[J].Language Resources and Evaluation, 2008, 42(4):335-359.DOI:10.1007/s10579-008-9076-6. [6] Tao J H, Fan C H, Lian Z, Lyu Z, Shen Y and Liang S. 2024. Development of multimodal sentiment recognition and understanding. Journal of Image and Graphics, 29 (06:1607-1627 [7] Tassi, D., Marchi, E., & Zaccaria, R. (2015). Speech emotion recognition using MFCC features and HMM classifiers. Procedia Computer Science, 62, 1152-1159. [8] Trigeorgis, G., Ringeval, F., Brueckner, R., Maciejewski, M., & Schuller, B. W. (2016). AdieuNet: Deep audio-visual emotion recognition with multiplicative LSTM integration. In 2016 16th IEEE International Conference on Data Mining Workshops (ICDMW), pp. 57–64. [9] Wollmer, M., Eyben, F., Schuller, B. W., & Rigoll, G. (2013). Recurrent neural networks for robust speech emotion recognition. In 2013 21st European Signal Processing Conference (EUSIPCO), pp. 1-5. [10] Hsu, W. -N., Tang, Y., Hsiao, Y., Chuang, Y. -T., Lee, S. -W., & Lin, S. -W. (2021). HuBERT: Self-supervised learning for spoken language understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5835-5846. [11] Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543. [12] Hutto, C. J., & Gilbert, E. E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media texts. In Proceedings of the Eighth International Conference on Weblogs and Social Media (ICWSM), pp. 216-225. [13] Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60. [14] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. [15] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR). [16] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (NIPS 2017), pp. 5998-6008. [17] Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2247–2256, Melbourne, Australia. Association for Computational Linguistics. [18] McKeown, G., Valstar, M. F., Cowie, R., & Pantic, M. (2012). SEMAINE: A multimodal database of emotionally coloured conversations. In 2012 5th International Workshop on Database and Expert Systems Applications (DEXA), pp. 432-436. [19] Poria, S., Cambria, E., Hazarika, D., & Hussain, A. (2016). A review of deep learning techniques for multimodal sentiment analysis. IEEE Transactions on Affective Computing, 7(4), 377-393. [20] Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. [21] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2009). The graph neural network model. IEEE Transactions on Neural Networks, 20 (1), 61-80. [22] Weston, J., Chopra, S., & Bordes, A. (2014). Memory networks. arXiv preprint arXiv:1410.3916. [23] Yuan, Y., & Liberman, M. (2008). P2FA: A tool for forced alignment of speech and text. Language Resources and Evaluation, 42 (4), 359-376. [24] J. Wang, M. Xue, R. Culhane, E. Diao, J. Ding and V. Tarokh, "Speech Emotion Recognition with Dual-Sequence LSTM Architecture," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 6474-6478, [25] Pham, H. L., Nguyen, T. N., & Phung, D. Q. (2019). Multimodal context transformation network for sentiment analysis. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 503-512. [26] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial networks. Advances in Neural Information Processing Systems (NIPS 2014), 2672-2680. [27] Devlin, J., Choe, M. -W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 4171-4186. [28] Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 1126-1135. [29] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. International Conference on Machine Learning, 8748-8763. [30] Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39 (7), 1418-1432.* [31] Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why should I trust you?" Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135-1144. [32] Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. NeurIPS 2017, 4765-4774. [33] Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.