Abstract
The research introduces a dual-purpose text classification system, designed to achieve both high predictive performance and real-world interpretability, which unites TF-IDF sparse traditional features with dense FastText and OpenAI GPT-2 semantic embeddings. The approach aims to utilize lexical features together with contextual semantic information to achieve better accuracy and robustness. The proposed architecture differs from handcrafted or transformer-based feature models because it merges various embedding channels which then pass through XGBoost, Random Forest and Logistic Regression classifiers. The system undergoes evaluation using SpamBase (spam vs. ham classification), Sentiment140 (sentiment polarity detection) and AG News (topic categorization) datasets. The model produces 98.9% accuracy with 0.99 F1-score for SpamBase dataset, 93.7% accuracy with 0.93 F1-score for Sentiment140 dataset and 94.6% accuracy with 0.94 F1-score for AG News dataset. It demonstrates effective performance across different domains. The ablation research reveals that TF-IDF and GPT-2 integration leads to a 4.6% enhancement of precision and recall and FastText embeddings help maintain system stability when dealing with noisy or informal text. The framework demonstrates resistance to adversarial phrasing as well as slang and content sparsity which are essential features for real-world applications. The proposed system achieves better performance than single-representation baselines through embedding-level fusion and classifier diversity while maintaining both interpretability and modularity. The research demonstrates the importance of uniting traditional NLP features with contemporary methods in an adaptable system which functions across different datasets for industrial text mining applications and hence, ensures generalizability.




Similar content being viewed by others
References
Ahmed, S., Bose, R., Lin, Q., & Zhao, T. (2024). Transferable embedding fusion via task-adaptive pretraining. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 912–924.
Alagarsamy S, James V, Raj RSP (2022) An experimental analysis of optimal hybrid word embedding methods for text classification using a movie review dataset. Braz Arch Biol Technol 65:e22210830
Aljohani E (2023) Enhancing Arabic text classification with a hybrid word embedding method. In: 2023 16th International Conference on Developments in eSystems Engineering (DeSE). IEEE. pp. 696–701.
Almeida, T. & Hidalgo, J. (2011). SMS Spam Collection . UCI Machine Learning Repository. https://doi.org/10.24432/C5CC84.
Alotaibi FS, Gupta V (2022) Sentiment analysis system using hybrid word embeddings with convolutional recurrent neural network. Int Arab J Inf Technol 19(3):330–335
Banirostam H, Banirostam T, Pedram MM, Rahmani AM (2023a) Providing and evaluating a comprehensive model for detecting fraudulent electronic payment card transactions with a two-level filter based on flow processing in big data. Int J Inf Technol 15:4161–4166. https://doi.org/10.1007/s41870-023-01501-0
Banirostam H, Banirostam T, Pedram MM, Rahmani AM (2023b) A model to detect the fraud of electronic payment card transactions based on stream processing in big data. J Signal Process Syst 23(1):1469–1484. https://doi.org/10.1007/s11265-023-01903-6
Banirostam H, Banirostam T, Pedram MM, Rahmani AM (2025) Analysis and evaluation of various fraud detection methods for electronic payment cards transactions in big data. J Signal Process Syst. https://doi.org/10.1007/s11265-025-01947-w
Devi, S. A., & Sivakumar, S. (2021). A hybrid ensemble word embedding based classification model for multi-document summarization process on large multi-domain document sets. International Journal of Advanced Computer Science and Applications, 12(9).
Doğan Ç and Sarıca S (2023). Comparison of classifiers for text classification in an e-commerce service. In: 2023 14th international conference on electrical and electronics engineering (ELECO). IEEE. pp 1–5
Galke, L., Diera, A., Lin, B. X., Khera, B., Meuser, T., Singhal, T., ... & Scherp, A. (2022). Are we really making much progress in text classification? A comparative review. arXiv preprint arXiv:2204.03954.
Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision. CS224Nproject report,Stanford, 1(12),2009. https://www.kaggle.com/kazanova/sentiment140
HuggingFace Model Hub. (2024). Model performance for BERT and RoBERTa on AG News and Sentiment140. Retrieved from https://huggingface.co/models
Moreo A, Esuli A, Sebastiani F (2021) Word-class embeddings for multiclass text classification. Data Min Knowl Discov 35:911–963
Poluru E, Syed H (2023) A hybrid deep learning GRU based approach for text classification using word embedding. EAI Endorsed Transactions on Internet of Things 10:10.4108
Sazan SA, Miraz MH, Rahman ABM (2024) Enhancing depressive post detection in bangla: a comparative study of TF-IDF, BERT and FastText Embeddings. arXiv preprint arXiv:2407.09187
Semary NA, Ahmed W, Amin K, Pławiak P, Hammad M (2023) Improving sentiment classification using a RoBERTa-based hybrid model. Front Hum Neurosci 17:1292010
Shamsinejad E, Banirostam T, Pedram MM, Rahmani AM (2024) Presenting a model of data anonymization in big data in the context of in-memory processing framework. J Elect Comput Eng Innov (JECEI) 12(1):79–98. https://doi.org/10.22061/jecei.2023.9737.651
Shreyashree, S., Sunagar, P., Rajarajeswari, S., & Kanavalli, A. (2022). BERT-based hybrid RNN model for multi-class text classification to study the effect of pre-trained word embeddings. International Journal of Advanced Computer Science and Applications, 13(9).
Song, X., Srimani, P. K., & Wang, J. Z. (2019, June). Hwe: Hybrid word embeddings for text classification. In Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval (pp. 25–29).
Sun JW, Bao JQ, Bu LP (2022) Text classification algorithm based on tf-idf and bert. In: 2022 11th International Conference of Information and Communication Technology (ICTech). IEEE. pp 1–4
Talukdar, W., & Biswas, A. (2024). Synergizing unsupervised and supervised learning: A hybrid approach for accurate natural language task modeling. arXiv preprint arXiv:2406.01096.
Wang Z, Li H, Kumar R, Zhang Y (2023) Dual attention fusion for text classification. Findings of the Association for Computational Linguistics: ACL 2023:187–197
Wang, G., Li, C., Wang, W., Zhang, Y., Shen, D., Zhang, X., ... & Carin, L. (2018). Joint embedding of words and labels for text classification. arXiv preprint arXiv:1805.04174.
Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
Funding
None of the authors have received any funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
There is no conflict of interest with anyone.
Human and/or animals participants
This article does not contain any studies involving animals and/ or human.
Informed consent
No patient was involved in this study.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chaudhary, A., Jain, S. Enhancing text classification using hybrid embeddings and advanced machine learning techniques. Int J Syst Assur Eng Manag (2025). https://doi.org/10.1007/s13198-025-02950-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13198-025-02950-x