Using Authorship Embeddings to Understand Writing Style in Social Media

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14163))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1123 Accesses
2 Citations

Abstract

With the escalation of misinformation and malicious behavior issues on social media platforms, traditional detection-based measures often fail to address the problem in time. The use of multiple accounts or the continuous creation of new accounts makes it difficult to re-detect the presence of a user who, for example, has disseminated false information. In this paper, we present a novel approach to understanding and characterizing authorship in social media using a model called $PART_{SCL}$, an improvement of the previous PART model. $PART_{SCL}$ generates “authorship embeddings”, numerical representations of an author’s writing style, allowing for more accurate and earlier detection of malicious behavior. Our main contributions include the $PART_{SCL}$ model itself, a new pre-training approach for authorship attribution, and the application of our model on different datasets. These advances help bridge the gap between popular Natural Language Processing techniques such as Transformers and feature engineering, providing a robust tool for the ongoing fight against online misbehavior and misinformation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Authorship Verification with Personalized Language Models

Authorship Verification on Short Text Samples Using Stylometric Embeddings

Toward a new approach to author profiling based on the extraction of statistical features

Article 22 June 2021

Notes

1.
RoBERTa large huggingface checkpoint: https://huggingface.co/roberta-large.

References

Argamon, S., Juola, P.: Overview of the international authorship identification competition at pan-2011. In: CLEF (Notebook Papers/Labs/Workshop) (2011)
Google Scholar
Barlas, G., Stamatatos, E.: Cross-domain authorship attribution using pre-trained language models. In: Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds.) AIAI 2020. IAICT, vol. 583, pp. 255–266. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49161-1_22
Chapter Google Scholar
Cheng, Z., Caverlee, J., Lee, K.: You are where you tweet: a content-based approach to geo-locating twitter users. In: Proceedings of the 19th ACM international conference on Information and knowledge management, pp. 759–768 (2010)
Google Scholar
Gerlach, M., Font-Clos, F.: A standardized project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics. Entropy 22(1), 126 (2020)
Article Google Scholar
Hu, Z., Lee, R.K.-W., Wang, L., Lim, E., Dai, B.: DeepStyle: user style embedding for authorship attribution of short texts. In: Wang, X., Zhang, R., Lee, Y.-K., Sun, L., Moon, Y.-S. (eds.) APWeb-WAIM 2020. LNCS, vol. 12318, pp. 221–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60290-1_17
Chapter Google Scholar
Huertas-Tato, J., Huertas-Garcia, A., Martin, A., Camacho, D.: PART: Pre-trained Authorship Representation Transformer. arXiv (2022). https://doi.org/10.48550/arXiv.2209.15373
Huertas-Tato, J., Martin, A., Huertas-Garcia, A., Camacho, D.: Generating authorship embeddings with transformers. In: 2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2022)
Google Scholar
Juola, P.: An overview of the traditional authorship attribution subtask. In: CLEF (Online Working Notes/Labs/Workshop), vol. 1178, p. 1 (2012)
Google Scholar
Kestemont, M., Stamatatos, E., Manjavacas, E., Daelemans, W., Potthast, M., Stein, B.: Overview of the cross-domain authorship attribution task at $\{$PAN$\}$ 2019. In: Working Notes of CLEF 2019-Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9–12, 2019, pp. 1–15 (2019)
Google Scholar
Kestemont, M., et al.: Overview of the author identification task at pan-2018: cross-domain authorship attribution and style change detection. In: Working Notes Papers of the CLEF 2018 Evaluation Labs. Avignon, France, September 10–14, 2018/Cappellato, Linda [edit.] et al, pp. 1–25 (2018)
Google Scholar
Khosla, P., et al.: Supervised Contrastive Learning. arXiv (2020). 10.48550/arXiv.2004.11362
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Manolache, A., Brad, F., Burceanu, E., Barbalau, A., Ionescu, R., Popescu, M.: Transferring Bert-like transformers’ knowledge for authorship verification. arXiv preprint arXiv:2112.05125 (2021)
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring symposium: Computational Approaches to Analyzing Weblogs, vol. 6, pp. 199–205 (2006)
Google Scholar
Shetty, J., Adibi, J.: The Enron email dataset database schema and brief statistical report. Information sciences institute technical report, University of Southern California, vol. 4, no. 1, pp. 120–128 (2004)
Google Scholar
V"olske, M., Potthast, M., Syed, S., Stein, B.: TL;DR: mining reddit to learn automatic summarization. In: Proceedings of the Workshop on New Frontiers in Summarization, pp. 59–63. Association for Computational Linguistics, Copenhagen, Denmark (2017). https://doi.org/10.18653/v1/W17-4508, https://www.aclweb.org/anthology/W17-4508

Download references

Acknowledgments

This work is part of the project PCI2022-134990-2 (MARTINI) of the CHISTERA IV Cofund 2021 program, funded by MCIN/AEI/10.13039/501100011033 and by the “European Union NextGenerationEU/PRTR”, by the Spanish Ministry of Science and Innovation under FightDIS (PID2020-117263GB-100) grant and MCIN/AEI/10.13039/501100011033/ and European Union NextGeneration EU/PRTR for XAI-Disinfodemics (PLEC2021-007681) grant, by European Comission under IBERIFIER - Iberian Digital Media Research and Fact-Checking Hub (2020-EU-IA-0252), and by“Convenio Plurianual with the Universidad Politécnica de Madrid in the actuation line of Programa de Excelencia para el Profesorado Universitario”. This publication is also part of the I+D+i project PLEC2021-007681, financed by MCIN/AEI/10.13039/501100011033/ and the European Union NextGeneration/PRTR.

Author information

Authors and Affiliations

Department of Computer Systems Engineering, Universidad Politécnica de Madrid, Madrid, Spain
Javier Huertas-Tato, Alejandro Martín & David Camacho

Authors

Javier Huertas-Tato
View author publications
Search author on:PubMed Google Scholar
Alejandro Martín
View author publications
Search author on:PubMed Google Scholar
David Camacho
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Alejandro Martín .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Avi Arampatzis
University of Amsterdam, Amsterdam, The Netherlands
Evangelos Kanoulas
CERTH-ITI, Thessaloniki, Greece
Theodora Tsikrika
CERTH-ITI, Thessaloniki, Greece
Stefanos Vrochidis
Utrecht University, Utrecht, The Netherlands
Anastasia Giachanou
Elsevier, Amsterdam, The Netherlands
Dan Li
University of Amsterdam, Amsterdam, The Netherlands
Mohammad Aliannejadi
University of Lausanne, Lausanne, Switzerland
Michalis Vlachos
University of Padua, Padova, Italy
Guglielmo Faggioli
University of Padua, Padova, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huertas-Tato, J., Martín, A., Camacho, D. (2023). Using Authorship Embeddings to Understand Writing Style in Social Media. In: Arampatzis, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2023. Lecture Notes in Computer Science, vol 14163. Springer, Cham. https://doi.org/10.1007/978-3-031-42448-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-42448-9_6
Published: 11 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42447-2
Online ISBN: 978-3-031-42448-9
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics