Abstract
With the escalation of misinformation and malicious behavior issues on social media platforms, traditional detection-based measures often fail to address the problem in time. The use of multiple accounts or the continuous creation of new accounts makes it difficult to re-detect the presence of a user who, for example, has disseminated false information. In this paper, we present a novel approach to understanding and characterizing authorship in social media using a model called \(PART_{SCL}\), an improvement of the previous PART model. \(PART_{SCL}\) generates “authorship embeddings”, numerical representations of an author’s writing style, allowing for more accurate and earlier detection of malicious behavior. Our main contributions include the \(PART_{SCL}\) model itself, a new pre-training approach for authorship attribution, and the application of our model on different datasets. These advances help bridge the gap between popular Natural Language Processing techniques such as Transformers and feature engineering, providing a robust tool for the ongoing fight against online misbehavior and misinformation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
RoBERTa large huggingface checkpoint: https://huggingface.co/roberta-large.
References
Argamon, S., Juola, P.: Overview of the international authorship identification competition at pan-2011. In: CLEF (Notebook Papers/Labs/Workshop) (2011)
Barlas, G., Stamatatos, E.: Cross-domain authorship attribution using pre-trained language models. In: Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds.) AIAI 2020. IAICT, vol. 583, pp. 255–266. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49161-1_22
Cheng, Z., Caverlee, J., Lee, K.: You are where you tweet: a content-based approach to geo-locating twitter users. In: Proceedings of the 19th ACM international conference on Information and knowledge management, pp. 759–768 (2010)
Gerlach, M., Font-Clos, F.: A standardized project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics. Entropy 22(1), 126 (2020)
Hu, Z., Lee, R.K.-W., Wang, L., Lim, E., Dai, B.: DeepStyle: user style embedding for authorship attribution of short texts. In: Wang, X., Zhang, R., Lee, Y.-K., Sun, L., Moon, Y.-S. (eds.) APWeb-WAIM 2020. LNCS, vol. 12318, pp. 221–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60290-1_17
Huertas-Tato, J., Huertas-Garcia, A., Martin, A., Camacho, D.: PART: Pre-trained Authorship Representation Transformer. arXiv (2022). https://doi.org/10.48550/arXiv.2209.15373
Huertas-Tato, J., Martin, A., Huertas-Garcia, A., Camacho, D.: Generating authorship embeddings with transformers. In: 2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2022)
Juola, P.: An overview of the traditional authorship attribution subtask. In: CLEF (Online Working Notes/Labs/Workshop), vol. 1178, p. 1 (2012)
Kestemont, M., Stamatatos, E., Manjavacas, E., Daelemans, W., Potthast, M., Stein, B.: Overview of the cross-domain authorship attribution task at \(\{\)PAN\(\}\) 2019. In: Working Notes of CLEF 2019-Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9–12, 2019, pp. 1–15 (2019)
Kestemont, M., et al.: Overview of the author identification task at pan-2018: cross-domain authorship attribution and style change detection. In: Working Notes Papers of the CLEF 2018 Evaluation Labs. Avignon, France, September 10–14, 2018/Cappellato, Linda [edit.] et al, pp. 1–25 (2018)
Khosla, P., et al.: Supervised Contrastive Learning. arXiv (2020). 10.48550/arXiv.2004.11362
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Manolache, A., Brad, F., Burceanu, E., Barbalau, A., Ionescu, R., Popescu, M.: Transferring Bert-like transformers’ knowledge for authorship verification. arXiv preprint arXiv:2112.05125 (2021)
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring symposium: Computational Approaches to Analyzing Weblogs, vol. 6, pp. 199–205 (2006)
Shetty, J., Adibi, J.: The Enron email dataset database schema and brief statistical report. Information sciences institute technical report, University of Southern California, vol. 4, no. 1, pp. 120–128 (2004)
V"olske, M., Potthast, M., Syed, S., Stein, B.: TL;DR: mining reddit to learn automatic summarization. In: Proceedings of the Workshop on New Frontiers in Summarization, pp. 59–63. Association for Computational Linguistics, Copenhagen, Denmark (2017). https://doi.org/10.18653/v1/W17-4508, https://www.aclweb.org/anthology/W17-4508
Acknowledgments
This work is part of the project PCI2022-134990-2 (MARTINI) of the CHISTERA IV Cofund 2021 program, funded by MCIN/AEI/10.13039/501100011033 and by the “European Union NextGenerationEU/PRTR”, by the Spanish Ministry of Science and Innovation under FightDIS (PID2020-117263GB-100) grant and MCIN/AEI/10.13039/501100011033/ and European Union NextGeneration EU/PRTR for XAI-Disinfodemics (PLEC2021-007681) grant, by European Comission under IBERIFIER - Iberian Digital Media Research and Fact-Checking Hub (2020-EU-IA-0252), and by“Convenio Plurianual with the Universidad Politécnica de Madrid in the actuation line of Programa de Excelencia para el Profesorado Universitario”. This publication is also part of the I+D+i project PLEC2021-007681, financed by MCIN/AEI/10.13039/501100011033/ and the European Union NextGeneration/PRTR.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Huertas-Tato, J., Martín, A., Camacho, D. (2023). Using Authorship Embeddings to Understand Writing Style in Social Media. In: Arampatzis, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2023. Lecture Notes in Computer Science, vol 14163. Springer, Cham. https://doi.org/10.1007/978-3-031-42448-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-42448-9_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42447-2
Online ISBN: 978-3-031-42448-9
eBook Packages: Computer ScienceComputer Science (R0)