Skip to main content
Log in

Enhancing text classification using hybrid embeddings and advanced machine learning techniques

  • ORIGINAL ARTICLE
  • Published:
International Journal of System Assurance Engineering and Management Aims and scope Submit manuscript

Abstract

The research introduces a dual-purpose text classification system, designed to achieve both high predictive performance and real-world interpretability, which unites TF-IDF sparse traditional features with dense FastText and OpenAI GPT-2 semantic embeddings. The approach aims to utilize lexical features together with contextual semantic information to achieve better accuracy and robustness. The proposed architecture differs from handcrafted or transformer-based feature models because it merges various embedding channels which then pass through XGBoost, Random Forest and Logistic Regression classifiers. The system undergoes evaluation using SpamBase (spam vs. ham classification), Sentiment140 (sentiment polarity detection) and AG News (topic categorization) datasets. The model produces 98.9% accuracy with 0.99 F1-score for SpamBase dataset, 93.7% accuracy with 0.93 F1-score for Sentiment140 dataset and 94.6% accuracy with 0.94 F1-score for AG News dataset. It demonstrates effective performance across different domains. The ablation research reveals that TF-IDF and GPT-2 integration leads to a 4.6% enhancement of precision and recall and FastText embeddings help maintain system stability when dealing with noisy or informal text. The framework demonstrates resistance to adversarial phrasing as well as slang and content sparsity which are essential features for real-world applications. The proposed system achieves better performance than single-representation baselines through embedding-level fusion and classifier diversity while maintaining both interpretability and modularity. The research demonstrates the importance of uniting traditional NLP features with contemporary methods in an adaptable system which functions across different datasets for industrial text mining applications and hence, ensures generalizability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Ahmed, S., Bose, R., Lin, Q., & Zhao, T. (2024). Transferable embedding fusion via task-adaptive pretraining. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 912–924.

  • Alagarsamy S, James V, Raj RSP (2022) An experimental analysis of optimal hybrid word embedding methods for text classification using a movie review dataset. Braz Arch Biol Technol 65:e22210830

    Article  Google Scholar 

  • Aljohani E (2023) Enhancing Arabic text classification with a hybrid word embedding method. In: 2023 16th International Conference on Developments in eSystems Engineering (DeSE). IEEE. pp. 696–701.

  • Almeida, T. & Hidalgo, J. (2011). SMS Spam Collection . UCI Machine Learning Repository. https://doi.org/10.24432/C5CC84.

  • Alotaibi FS, Gupta V (2022) Sentiment analysis system using hybrid word embeddings with convolutional recurrent neural network. Int Arab J Inf Technol 19(3):330–335

    Google Scholar 

  • Banirostam H, Banirostam T, Pedram MM, Rahmani AM (2023a) Providing and evaluating a comprehensive model for detecting fraudulent electronic payment card transactions with a two-level filter based on flow processing in big data. Int J Inf Technol 15:4161–4166. https://doi.org/10.1007/s41870-023-01501-0

    Article  Google Scholar 

  • Banirostam H, Banirostam T, Pedram MM, Rahmani AM (2023b) A model to detect the fraud of electronic payment card transactions based on stream processing in big data. J Signal Process Syst 23(1):1469–1484. https://doi.org/10.1007/s11265-023-01903-6

    Article  Google Scholar 

  • Banirostam H, Banirostam T, Pedram MM, Rahmani AM (2025) Analysis and evaluation of various fraud detection methods for electronic payment cards transactions in big data. J Signal Process Syst. https://doi.org/10.1007/s11265-025-01947-w

    Article  Google Scholar 

  • Devi, S. A., & Sivakumar, S. (2021). A hybrid ensemble word embedding based classification model for multi-document summarization process on large multi-domain document sets. International Journal of Advanced Computer Science and Applications12(9).

  • Doğan Ç and Sarıca S (2023). Comparison of classifiers for text classification in an e-commerce service. In: 2023 14th international conference on electrical and electronics engineering (ELECO). IEEE. pp 1–5

  • Galke, L., Diera, A., Lin, B. X., Khera, B., Meuser, T., Singhal, T., ... & Scherp, A. (2022). Are we really making much progress in text classification? A comparative review. arXiv preprint arXiv:2204.03954.

  • Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision. CS224Nproject report,Stanford1(12),2009. https://www.kaggle.com/kazanova/sentiment140

  • HuggingFace Model Hub. (2024). Model performance for BERT and RoBERTa on AG News and Sentiment140. Retrieved from https://huggingface.co/models

  • Moreo A, Esuli A, Sebastiani F (2021) Word-class embeddings for multiclass text classification. Data Min Knowl Discov 35:911–963

    Article  MathSciNet  MATH  Google Scholar 

  • Poluru E, Syed H (2023) A hybrid deep learning GRU based approach for text classification using word embedding. EAI Endorsed Transactions on Internet of Things 10:10.4108

    Google Scholar 

  • Sazan SA, Miraz MH, Rahman ABM (2024) Enhancing depressive post detection in bangla: a comparative study of TF-IDF, BERT and FastText Embeddings. arXiv preprint arXiv:2407.09187

  • Semary NA, Ahmed W, Amin K, Pławiak P, Hammad M (2023) Improving sentiment classification using a RoBERTa-based hybrid model. Front Hum Neurosci 17:1292010

    Article  Google Scholar 

  • Shamsinejad E, Banirostam T, Pedram MM, Rahmani AM (2024) Presenting a model of data anonymization in big data in the context of in-memory processing framework. J Elect Comput Eng Innov (JECEI) 12(1):79–98. https://doi.org/10.22061/jecei.2023.9737.651

    Article  Google Scholar 

  • Shreyashree, S., Sunagar, P., Rajarajeswari, S., & Kanavalli, A. (2022). BERT-based hybrid RNN model for multi-class text classification to study the effect of pre-trained word embeddings. International Journal of Advanced Computer Science and Applications13(9).

  • Song, X., Srimani, P. K., & Wang, J. Z. (2019, June). Hwe: Hybrid word embeddings for text classification. In Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval (pp. 25–29).

  • Sun JW, Bao JQ, Bu LP (2022) Text classification algorithm based on tf-idf and bert. In: 2022 11th International Conference of Information and Communication Technology (ICTech). IEEE. pp 1–4

  • Talukdar, W., & Biswas, A. (2024). Synergizing unsupervised and supervised learning: A hybrid approach for accurate natural language task modeling. arXiv preprint arXiv:2406.01096.

  • Wang Z, Li H, Kumar R, Zhang Y (2023) Dual attention fusion for text classification. Findings of the Association for Computational Linguistics: ACL 2023:187–197

    Google Scholar 

  • Wang, G., Li, C., Wang, W., Zhang, Y., Shen, D., Zhang, X., ... & Carin, L. (2018). Joint embedding of words and labels for text classification. arXiv preprint arXiv:1805.04174.

  • Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. Advances in neural information processing systems28.

Download references

Funding

None of the authors have received any funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shikha Jain.

Ethics declarations

Conflict of interest

There is no conflict of interest with anyone.

Human and/or animals participants

This article does not contain any studies involving animals and/ or human.

Informed consent

No patient was involved in this study.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chaudhary, A., Jain, S. Enhancing text classification using hybrid embeddings and advanced machine learning techniques. Int J Syst Assur Eng Manag (2025). https://doi.org/10.1007/s13198-025-02950-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13198-025-02950-x

Keywords

Profiles

  1. Shikha Jain