Enhancing text classification using hybrid embeddings and advanced machine learning techniques

49 Accesses
Explore all metrics

Abstract

The research introduces a dual-purpose text classification system, designed to achieve both high predictive performance and real-world interpretability, which unites TF-IDF sparse traditional features with dense FastText and OpenAI GPT-2 semantic embeddings. The approach aims to utilize lexical features together with contextual semantic information to achieve better accuracy and robustness. The proposed architecture differs from handcrafted or transformer-based feature models because it merges various embedding channels which then pass through XGBoost, Random Forest and Logistic Regression classifiers. The system undergoes evaluation using SpamBase (spam vs. ham classification), Sentiment140 (sentiment polarity detection) and AG News (topic categorization) datasets. The model produces 98.9% accuracy with 0.99 F1-score for SpamBase dataset, 93.7% accuracy with 0.93 F1-score for Sentiment140 dataset and 94.6% accuracy with 0.94 F1-score for AG News dataset. It demonstrates effective performance across different domains. The ablation research reveals that TF-IDF and GPT-2 integration leads to a 4.6% enhancement of precision and recall and FastText embeddings help maintain system stability when dealing with noisy or informal text. The framework demonstrates resistance to adversarial phrasing as well as slang and content sparsity which are essential features for real-world applications. The proposed system achieves better performance than single-representation baselines through embedding-level fusion and classifier diversity while maintaining both interpretability and modularity. The research demonstrates the importance of uniting traditional NLP features with contemporary methods in an adaptable system which functions across different datasets for industrial text mining applications and hence, ensures generalizability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Influence of Text Representations on the Efficacy of Classification Models Across Diverse Datasets

Text Categorization Using Convolutional and Bidirectional Fast Gated Recurrent Unit

Levantine hate speech detection in twitter

Article 29 August 2022

References

Ahmed, S., Bose, R., Lin, Q., & Zhao, T. (2024). Transferable embedding fusion via task-adaptive pretraining. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 912–924.
Alagarsamy S, James V, Raj RSP (2022) An experimental analysis of optimal hybrid word embedding methods for text classification using a movie review dataset. Braz Arch Biol Technol 65:e22210830
Article Google Scholar
Aljohani E (2023) Enhancing Arabic text classification with a hybrid word embedding method. In: 2023 16th International Conference on Developments in eSystems Engineering (DeSE). IEEE. pp. 696–701.
Almeida, T. & Hidalgo, J. (2011). SMS Spam Collection . UCI Machine Learning Repository. https://doi.org/10.24432/C5CC84.
Alotaibi FS, Gupta V (2022) Sentiment analysis system using hybrid word embeddings with convolutional recurrent neural network. Int Arab J Inf Technol 19(3):330–335
Google Scholar
Banirostam H, Banirostam T, Pedram MM, Rahmani AM (2023a) Providing and evaluating a comprehensive model for detecting fraudulent electronic payment card transactions with a two-level filter based on flow processing in big data. Int J Inf Technol 15:4161–4166. https://doi.org/10.1007/s41870-023-01501-0
Article Google Scholar
Banirostam H, Banirostam T, Pedram MM, Rahmani AM (2023b) A model to detect the fraud of electronic payment card transactions based on stream processing in big data. J Signal Process Syst 23(1):1469–1484. https://doi.org/10.1007/s11265-023-01903-6
Article Google Scholar
Banirostam H, Banirostam T, Pedram MM, Rahmani AM (2025) Analysis and evaluation of various fraud detection methods for electronic payment cards transactions in big data. J Signal Process Syst. https://doi.org/10.1007/s11265-025-01947-w
Article Google Scholar
Devi, S. A., & Sivakumar, S. (2021). A hybrid ensemble word embedding based classification model for multi-document summarization process on large multi-domain document sets. International Journal of Advanced Computer Science and Applications, 12(9).
Doğan Ç and Sarıca S (2023). Comparison of classifiers for text classification in an e-commerce service. In: 2023 14th international conference on electrical and electronics engineering (ELECO). IEEE. pp 1–5
Galke, L., Diera, A., Lin, B. X., Khera, B., Meuser, T., Singhal, T., ... & Scherp, A. (2022). Are we really making much progress in text classification? A comparative review. arXiv preprint arXiv:2204.03954.
Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision. CS224Nproject report,Stanford, 1(12),2009. https://www.kaggle.com/kazanova/sentiment140
HuggingFace Model Hub. (2024). Model performance for BERT and RoBERTa on AG News and Sentiment140. Retrieved from https://huggingface.co/models
Moreo A, Esuli A, Sebastiani F (2021) Word-class embeddings for multiclass text classification. Data Min Knowl Discov 35:911–963
Article MathSciNet MATH Google Scholar
Poluru E, Syed H (2023) A hybrid deep learning GRU based approach for text classification using word embedding. EAI Endorsed Transactions on Internet of Things 10:10.4108
Google Scholar
Sazan SA, Miraz MH, Rahman ABM (2024) Enhancing depressive post detection in bangla: a comparative study of TF-IDF, BERT and FastText Embeddings. arXiv preprint arXiv:2407.09187
Semary NA, Ahmed W, Amin K, Pławiak P, Hammad M (2023) Improving sentiment classification using a RoBERTa-based hybrid model. Front Hum Neurosci 17:1292010
Article Google Scholar
Shamsinejad E, Banirostam T, Pedram MM, Rahmani AM (2024) Presenting a model of data anonymization in big data in the context of in-memory processing framework. J Elect Comput Eng Innov (JECEI) 12(1):79–98. https://doi.org/10.22061/jecei.2023.9737.651
Article Google Scholar
Shreyashree, S., Sunagar, P., Rajarajeswari, S., & Kanavalli, A. (2022). BERT-based hybrid RNN model for multi-class text classification to study the effect of pre-trained word embeddings. International Journal of Advanced Computer Science and Applications, 13(9).
Song, X., Srimani, P. K., & Wang, J. Z. (2019, June). Hwe: Hybrid word embeddings for text classification. In Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval (pp. 25–29).
Sun JW, Bao JQ, Bu LP (2022) Text classification algorithm based on tf-idf and bert. In: 2022 11th International Conference of Information and Communication Technology (ICTech). IEEE. pp 1–4
Talukdar, W., & Biswas, A. (2024). Synergizing unsupervised and supervised learning: A hybrid approach for accurate natural language task modeling. arXiv preprint arXiv:2406.01096.
Wang Z, Li H, Kumar R, Zhang Y (2023) Dual attention fusion for text classification. Findings of the Association for Computational Linguistics: ACL 2023:187–197
Google Scholar
Wang, G., Li, C., Wang, W., Zhang, Y., Shen, D., Zhang, X., ... & Carin, L. (2018). Joint embedding of words and labels for text classification. arXiv preprint arXiv:1805.04174.
Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.

Download references

Funding

None of the authors have received any funding.

Author information

Authors and Affiliations

Department of Computer Science and Engineering and Information Technology, Jaypee Institute of Information Technology, Noida, India
Abhishek Chaudhary & Shikha Jain

Authors

Abhishek Chaudhary
View author publications
Search author on:PubMed Google Scholar
Shikha Jain
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Shikha Jain.

Ethics declarations

Conflict of interest

There is no conflict of interest with anyone.

Human and/or animals participants

This article does not contain any studies involving animals and/ or human.

Informed consent

No patient was involved in this study.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chaudhary, A., Jain, S. Enhancing text classification using hybrid embeddings and advanced machine learning techniques. Int J Syst Assur Eng Manag (2025). https://doi.org/10.1007/s13198-025-02950-x

Download citation

Received: 23 April 2025
Accepted: 22 August 2025
Published: 15 September 2025
DOI: https://doi.org/10.1007/s13198-025-02950-x

Keywords

Profiles

Shikha Jain View author profile

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Influence of Text Representations on the Efficacy of Classification Models Across Diverse Datasets

Text Categorization Using Convolutional and Bidirectional Fast Gated Recurrent Unit

Levantine hate speech detection in twitter

Explore related subjects

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and/or animals participants

Informed consent

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now