Skip to main content

Using Authorship Embeddings to Understand Writing Style in Social Media

  • Conference paper
  • First Online:
Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2023)

Abstract

With the escalation of misinformation and malicious behavior issues on social media platforms, traditional detection-based measures often fail to address the problem in time. The use of multiple accounts or the continuous creation of new accounts makes it difficult to re-detect the presence of a user who, for example, has disseminated false information. In this paper, we present a novel approach to understanding and characterizing authorship in social media using a model called \(PART_{SCL}\), an improvement of the previous PART model. \(PART_{SCL}\) generates “authorship embeddings”, numerical representations of an author’s writing style, allowing for more accurate and earlier detection of malicious behavior. Our main contributions include the \(PART_{SCL}\) model itself, a new pre-training approach for authorship attribution, and the application of our model on different datasets. These advances help bridge the gap between popular Natural Language Processing techniques such as Transformers and feature engineering, providing a robust tool for the ongoing fight against online misbehavior and misinformation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
eBook
USD 79.99
Price excludes VAT (USA)
Softcover Book
USD 99.99
Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    RoBERTa large huggingface checkpoint: https://huggingface.co/roberta-large.

References

  1. Argamon, S., Juola, P.: Overview of the international authorship identification competition at pan-2011. In: CLEF (Notebook Papers/Labs/Workshop) (2011)

    Google Scholar 

  2. Barlas, G., Stamatatos, E.: Cross-domain authorship attribution using pre-trained language models. In: Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds.) AIAI 2020. IAICT, vol. 583, pp. 255–266. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49161-1_22

    Chapter  Google Scholar 

  3. Cheng, Z., Caverlee, J., Lee, K.: You are where you tweet: a content-based approach to geo-locating twitter users. In: Proceedings of the 19th ACM international conference on Information and knowledge management, pp. 759–768 (2010)

    Google Scholar 

  4. Gerlach, M., Font-Clos, F.: A standardized project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics. Entropy 22(1), 126 (2020)

    Article  Google Scholar 

  5. Hu, Z., Lee, R.K.-W., Wang, L., Lim, E., Dai, B.: DeepStyle: user style embedding for authorship attribution of short texts. In: Wang, X., Zhang, R., Lee, Y.-K., Sun, L., Moon, Y.-S. (eds.) APWeb-WAIM 2020. LNCS, vol. 12318, pp. 221–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60290-1_17

    Chapter  Google Scholar 

  6. Huertas-Tato, J., Huertas-Garcia, A., Martin, A., Camacho, D.: PART: Pre-trained Authorship Representation Transformer. arXiv (2022). https://doi.org/10.48550/arXiv.2209.15373

  7. Huertas-Tato, J., Martin, A., Huertas-Garcia, A., Camacho, D.: Generating authorship embeddings with transformers. In: 2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2022)

    Google Scholar 

  8. Juola, P.: An overview of the traditional authorship attribution subtask. In: CLEF (Online Working Notes/Labs/Workshop), vol. 1178, p. 1 (2012)

    Google Scholar 

  9. Kestemont, M., Stamatatos, E., Manjavacas, E., Daelemans, W., Potthast, M., Stein, B.: Overview of the cross-domain authorship attribution task at \(\{\)PAN\(\}\) 2019. In: Working Notes of CLEF 2019-Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9–12, 2019, pp. 1–15 (2019)

    Google Scholar 

  10. Kestemont, M., et al.: Overview of the author identification task at pan-2018: cross-domain authorship attribution and style change detection. In: Working Notes Papers of the CLEF 2018 Evaluation Labs. Avignon, France, September 10–14, 2018/Cappellato, Linda [edit.] et al, pp. 1–25 (2018)

    Google Scholar 

  11. Khosla, P., et al.: Supervised Contrastive Learning. arXiv (2020). 10.48550/arXiv.2004.11362

    Google Scholar 

  12. Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  13. Manolache, A., Brad, F., Burceanu, E., Barbalau, A., Ionescu, R., Popescu, M.: Transferring Bert-like transformers’ knowledge for authorship verification. arXiv preprint arXiv:2112.05125 (2021)

  14. Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  15. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring symposium: Computational Approaches to Analyzing Weblogs, vol. 6, pp. 199–205 (2006)

    Google Scholar 

  16. Shetty, J., Adibi, J.: The Enron email dataset database schema and brief statistical report. Information sciences institute technical report, University of Southern California, vol. 4, no. 1, pp. 120–128 (2004)

    Google Scholar 

  17. V"olske, M., Potthast, M., Syed, S., Stein, B.: TL;DR: mining reddit to learn automatic summarization. In: Proceedings of the Workshop on New Frontiers in Summarization, pp. 59–63. Association for Computational Linguistics, Copenhagen, Denmark (2017). https://doi.org/10.18653/v1/W17-4508, https://www.aclweb.org/anthology/W17-4508

Download references

Acknowledgments

This work is part of the project PCI2022-134990-2 (MARTINI) of the CHISTERA IV Cofund 2021 program, funded by MCIN/AEI/10.13039/501100011033 and by the “European Union NextGenerationEU/PRTR”, by the Spanish Ministry of Science and Innovation under FightDIS (PID2020-117263GB-100) grant and MCIN/AEI/10.13039/501100011033/ and European Union NextGeneration EU/PRTR for XAI-Disinfodemics (PLEC2021-007681) grant, by European Comission under IBERIFIER - Iberian Digital Media Research and Fact-Checking Hub (2020-EU-IA-0252), and by“Convenio Plurianual with the Universidad Politécnica de Madrid in the actuation line of Programa de Excelencia para el Profesorado Universitario”. This publication is also part of the I+D+i project PLEC2021-007681, financed by MCIN/AEI/10.13039/501100011033/ and the European Union NextGeneration/PRTR.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alejandro Martín .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huertas-Tato, J., Martín, A., Camacho, D. (2023). Using Authorship Embeddings to Understand Writing Style in Social Media. In: Arampatzis, A., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2023. Lecture Notes in Computer Science, vol 14163. Springer, Cham. https://doi.org/10.1007/978-3-031-42448-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-42448-9_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-42447-2

  • Online ISBN: 978-3-031-42448-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics