Investigating the Efficiency of Machine Learning Methods in Authorship Detection for Low-Resourced Languages: The Case of Kurdish Authors

Statistics

Article Views: 242

PDF Downloads: 76

Date of Publication : 2023-06-21 Article Type : Research Article

Download

Investigating the Efficiency of Machine Learning Methods in Authorship Detection for Low-Resourced Languages: The Case of Kurdish Authors

Saia Hasan ¹* and Hossein Hassani ¹

Affiliation

¹ Computer Science and Engineering, University of Kurdistan Hewlêr, Kurdistan Region, Iraq
* Corresponding Author

ORCID :

Saia Hasan: https://orcid.org/0000-0002-7864-3482, Hossein Hassani: https://orcid.org/0000-0002-8899-4016

DOI :

https://doi.org/10.23918/eajse.v9i2p14

Article History

Received: 2023-04-13

Revised: 2023-06-06

Accepted: 2023-06-07

Abstract

Textual data continues to multiply with time, Alongside the exponential growth of textual information, an increase in anonymous material has also been seen. Authorship detection has significant potential for usage in numerous applications of authorship analysis, such as history and literary science, Forensic examination, or Plagiarism detection. We manually collected 2798 documents by 150 authors for this study in order to investigate how effectively existing machine learning algorithms can differentiate Kurdish authors from unidentified writings. The approach that has been developed uses a TF-IDF technique to calculate the weight of each token and extracts the token frequency of each token, ranging from 1 to 5 grams, as a feature to find a pattern in each author's text. We train SVM, CNB, MNB, and K-NN classifiers with a collection of available documents because an unknown document's essential tokens are similar to a known document's crucial tokens. Then we give it a mysterious document so it may assess how closely it resembles the known document. We achieved an accuracy of 80% by SVM with both O-V-O and O-V-R approaches for the token 1-gram, also a promising results in precision, recall, and F1-score measures. Furthermore, to our knowledge, this is the first study to investigate authorship detection for the Kurdish language.

Keywords :

Authorship Detection; NLP; Authorship Analysis; KLPT; ML; TF-IDF

[1] Chowdhury GG. Natural language processing. Annual Review of Information Science and Technology. 2003; 37(1): 51–89. https://doi.org/10.1002/aris.1440370103

[2] Roy N. Authorship Analysis as a Text Classification or Clustering Problem. 2019.
Google Scholar

[3] Iqbal F, Debbabi M, Fung BC. Machine learning for authorship attribution and cyber forensics, Heidelberg: Springer. 2020; 52-55.
Google Scholar

[4] Stamatatos E, Daelemans W, Verhoeven B, Potthast M, Stein B, Juola P, Sanchez-Perez MA, Barrón-Cedeño A. Overview of the author identification task at PAN 2014. In CLEF 2014 Evaluation Labs and Workshop Working Notes Papers, Sheffield, UK. 2014; 1-21.
Google Scholar

[5] Sikos J, David P, Habash N, Faraj R. Authorship analysis of inspire magazine through stylometric and psychological features. In 2014 IEEE Joint Intelligence and Security Informatics Conference, IEEE. 2014; 33-40. https://doi.org/10.1109/JISIC.2014.15
Google Scholar

[6] Ahmed H. The role of linguistic feature categories in authorship verification. Procedia computer science. 2018; 142: 214-221. https://doi.org/10.1016/j.procs.2018.10.478
Google Scholar

[7] Juola P. Authorship attribution. Foundations and Trends® in Information Retrieval. 2006; 1(3): 233-334.
Google Scholar

[8] Tamboli MS, Prasad RS. Authorship analysis and identification techniques: A review. International Journal of Computer Applications. 2013 Jan 1; 77(16).
Google Scholar

[9] Hriez S, Awajan A. Authorship Identification for Arabic Texts Using Logistic Model Tree Classification. InIntelligent Computing: Proceedings of the 2020 Computing Conference, Volume 2 2020 (pp. 656-666). Springer International Publishing. https://doi.org/10.1007/978-3-030-52246-9_48
Google Scholar

[10] Farahmandpour Z, Nikmehr H. A Study on Intelligent Authorship Methods in Persian Language. Journal of Computing and Security. 2015 Jan 1; 2(1): 63-76.
Google Scholar

[11] Daelemans W. Explanation in computational stylometry. InComputational Linguistics and Intelligent Text Processing: 14th International Conference, CICLing 2013, Samos, Greece, March 24-30, 2013, Proceedings, Part II 14 2013; 451-462. https://doi.org/10.1007/978-3-642-37256-8_37
Google Scholar

[12] Payer M, Huang L, Gong NZ, Borgolte K, Frank M. What you submit is who you are: A multimodal approach for deanonymizing scientific publications. IEEE Transactions on Information Forensics and Security. 2014 Nov 6; 10(1): 200-12. https://doi.org/10.1109/TIFS.2014.2368355
Google Scholar

[13] Pasdar Y, Najafi F, Moradinazar M, Shakiba E, Karim H, Hamzeh B, Nelson M, Dobson A. Cohort profile: Ravansar Non-Communicable Disease cohort study: the first cohort study in a Kurdish population. International journal of epidemiology. 2019 Jun 1; 48(3): 682-3f. https://doi.org/10.1093/ije/dyy296
Google Scholar

[14] Hassanpour A, Sheyholislami J, Skutnabb-Kangas T. Introduction. Kurdish: Linguicide, resistance and hope. International Journal of the Sociology of Language. 2012 Sep 13; 2012(217): 1-8. https://doi.org/10.1515/ijsl-2012-0047
Google Scholar

[15] Windfuhr G. ed. The Iranian languages. 1st Edition ed. London: Routledge. 2009; 418.

[16] BLAU J. The Kurdish Language and Literature. Fondation-Institut kurde de Paris, Available at: https://www.institutkurde.org/en/language/

[17] Hassani H, Medjedovic D. Automatic Kurdish dialects identification. Computer Science & Information Technology. 2016 Feb 6; 6(2): 61-78. https://doi.org/10.5121/CSIT.2016.60307
Google Scholar

[18] Khalid HS. Kurdish dialect continuum, as a standardization solution. International Journal of Kurdish Studies. 2015;1(1):27-39. https://doi.org/10.21600/ijks.95271
Google Scholar

[19] Esmaili KS, Salavati S. Sorani Kurdish versus Kurmanji Kurdish: an empirical comparison. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 2013 Aug; 300-305.
Google Scholar

[20] Taucher W, Vogl M, Webinger P. Refworld | The Kurds: History –Religion – Language - Politics. [online] Refworld. 2015. Available at: https://www.refworld.org/docid/568cf9924.html.

[21] Esmaili KS. Challenges in Kurdish text processing. arXiv preprint arXiv:1212.0074. 2012 Dec 1. https://doi.org/10.48550/arXiv.1212.0074
Google Scholar

[22] Ahmadi S. A Formal Description of Sorani Kurdish Morphology. arXiv preprint arXiv:2109.03942. 2021 Sep 8. https://doi.org/10.48550/arXiv.2109.03942
Google Scholar

[23] Ramezani R. A language-independent authorship attribution approach for author identification of text documents. Expert Systems with Applications. 2021 Oct 15; 180: 115139. https://doi.org/10.1016/j.eswa.2021.115139
Google Scholar

[24] Luyckx K, Daelemans W. Authorship attribution and verification with many authors and limited data. InProceedings of the 22nd international conference on computational linguistics (COLING 2008) 2008 Aug; 513-520.
Google Scholar

[25] Nazir Z, Shahzad K, Malik MK, Anwar W, Bajwa IS, Mehmood K. Authorship Attribution for a Resource Poor Language—Urdu. Transactions on Asian and Low-Resource Language Information Processing. 2021 Dec 14; 21(3): 1-23. https://doi.org/10.1145/3487061
Google Scholar

[26] Ramezani R, Sheydaei N, Kahani M. Evaluating the effects of textual features on authorship attribution accuracy. InICCKE 2013 2013 Oct 31; 108-113. https://doi.org/10.1109/ICCKE.2013.6682828
Google Scholar

[27] Abbasi A, Chen H. Applying authorship analysis to Arabic web content. InIntelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics, ISI 2005, Atlanta, GA, USA, May 19-20, 2005. Proceedings 3 2005; 183-197. Springer Berlin Heidelberg. https://doi.org/10.1007/11427995_15
Google Scholar

[28] Ahmadi S. KLPT–Kurdish language processing toolkit. InProceedings of second workshop for NLP open source software (NLP-OSS) 2020 Nov; 72-84. https://doi.org/10.18653/v1/2020.nlposs-1.11
Google Scholar

[29] Anwar W, Bajwa IS, Ramzan S. Design and implementation of a machine learning-based authorship identification model. Scientific Programming. 2019 Jan 16; 2019. https://doi.org/10.1155/2019/9431073
Google Scholar

[30] Hiran KK, Jain RK, Lakhwani K, Doshi R. Machine Learning: Master Supervised and Unsupervised Learning Algorithms with Real Examples (English Edition). BPB Publications; 2021 Sep 16.
Google Scholar

[31] Brownlee J. Machine learning algorithms from scratch with Python. Machine Learning Mastery; 2016 Nov 16.
Google Scholar

[32] Tan RHR, Tsai FS. Authorship identification for online text. In 2010 International Conference on Cyberworlds, IEEE. 2010; 155-162. https://doi.org/10.1109/CW.2010.50
Google Scholar

[33] Seref B, Bostanci E. Performance comparison of Naïve Bayes and complement Naïve Bayes algorithms. In2019 6th international conference on electrical and electronics engineering (ICEEE) 2019 Apr 16; 131-138. https://doi.org/10.1109/ICEEE2019.2019.00033
Google Scholar

[34] Stamatatos E. Author identification using imbalanced and limited training texts. In 18th International Workshop on Database and Expert Systems Applications (DEXA 2007) 2007 Sep 3; 237-241. https://doi.org/10.1109/DEXA.2007.5
Google Scholar

How to cite

BibTeX

@article{hasan,saiaandhassani,hossein2023,
 author = {Hasan, Saia  and Hassani, Hossein},
 title = {Investigating the Efficiency of Machine Learning Methods in Authorship Detection for Low-Resourced Languages: The Case of Kurdish Authors},
 journal = {Eurasian J. Sci. Eng},
 volume = {9},
 number = {2},
 pages = {178-194},
 year = {2023}
}

Copy

APA

Hasan, S., & Hassani, H. (2023). Investigating the Efficiency of Machine Learning Methods in Authorship Detection for Low-Resourced Languages: The Case of Kurdish Authors. Eurasian J. Sci. Eng, 9(2),178-194.

Copy

MLA

Hasan, S., & Hassani, H. "Investigating the Efficiency of Machine Learning Methods in Authorship Detection for Low-Resourced Languages: The Case of Kurdish Authors." Eurasian J. Sci. Eng, 9.2, (2023), pp.178-194.

Copy

HARVARD

Hasan, S. and Hassani, H., (2023) "Investigating the Efficiency of Machine Learning Methods in Authorship Detection for Low-Resourced Languages: The Case of Kurdish Authors", Eurasian J. Sci. Eng, 9(2), pp.178-194.

Copy

VANCOUVER

Hasan S, Hassani H. Investigating the Efficiency of Machine Learning Methods in Authorship Detection for Low-Resourced Languages: The Case of Kurdish Authors. Eurasian J. Sci. Eng. 2023; 9(2):178-194.

Copy

RIS

Under Development

EndNote

Under Development

Mendeley

Under Development

Statistics

Investigating the Efficiency of Machine Learning Methods in Authorship Detection for Low-Resourced Languages: The Case of Kurdish Authors

How to cite

Investigating the Efficiency of Machine Learning Methods in Authorship Detection for Low-Resourced Languages: The Case of Kurdish Authors

Journal Metrics

Get In Touch