Natural Language Processing Based Poet Recognition with Supervised Learning on Turkish Poetry Dataset

Serkan Korkmaz

Harran University

https://orcid.org/0000-0003-2523-8819

Fehim Köylü

Erciyes University

https://orcid.org/0000-0001-7991-5841

DOI: https://doi.org/10.56038/oprd.v4i1.470

Keywords: Natural language processing, text mining, supervised learning, support vector machine, Bayesian classifiers, decision trees, random forest


Abstract

Natural language processing-based studies become popular nowadays and Turkish based studies are increasing. The problem of author classification is based on determining whether an anonymous text belongs to one of the popular authors. This research problem is motivated by the idea that each author's work will reflect some basic features about the author's intellectual vocabulary and thus it should be possible to distinguish between authors. In this study, 50 poems of 5 different poets from Turkish Literature were taken and a dataset was obtained. Experiments were performed on the dataset using 9 different classifier methods. This is a preliminary study that will serve as a basis for future studies.


References

D. Reinsel, J. Gantz, J. Rydning, "The digitization of the world from edge to core". International Data Corporation, 16, 1-28, 2018.

A. Oğuzlar, "Temel Metin Madenciliği". Dora Yayınları, 2011.

E. Adalı, "Türkçe Doğal Dil İşleme". Akçağ Yayınları, 2020.

Z. Korkmaz, "Türkiye Türkçesi Grameri Şekil Bilgisi". Türk Dil Kurumu Yayınları, 2009.

C. M. Stamatatos, "Automatic authorship attribution". Ninth Conference of the European Chapter of the Association for Computational Linguistics, 1999. DOI: https://doi.org/10.3115/977035.977057

D. Ünal, Ş. E. Şeker, "Metin Madenciliğinde Yazar Tanıma (Author Recognition in Text Mining)". BS Ansiklopedisi, 2018. Ninth Conference of the European Chapter of the Association for Computational Linguistics, 1999.

F. Mosteller, D. L. Wallace, "Applied Bayesian and Classical Inference: The Case of the Federalist Papers". Addison-Wesley, 1984. DOI: https://doi.org/10.1007/978-1-4612-5256-6

K. Oflazer, Two-level description of Turkish morphology. In Literary and linguistic computing, volume 752, pages 137-148. Madison, WI, 1998. DOI: https://doi.org/10.1093/llc/9.2.137

G. Cebiroğlu, "Sentetik Türkçe Sözcük Kökleri Üretimi". International XII. Turkish Symposium on Artificial Intelligence and Neural Networks–TAINN, 2003.

İ. Büyukkuşcu, E. Adalı, "Heceleme Yöntemiyle Kök Sözcük Üretme". International XII. Turkish Symposium on Artificial Intelligence and Neural Networks–TAINN, 2003.

C. M. Tan, Y. F. Wang, et al. "The use of bigrams to enhance text categorization". Information Processing & Management 38(4), 2002. DOI: https://doi.org/10.1016/S0306-4573(01)00045-0

B. Diri, F. Amasyalı, "Automatic author detection for Turkish texts". Artificial Neural Networks and Neural Information Processing (ICANN/ICONIP), 2003.

F. Amasyalı, B. Diri, et al. "Farklı özellik vektörleri ile Türkçe dokümanların yazarlarının belirlenmesi". 15th Turkish Symposium on Artificial Intelligence and Neural Network, Muğla, Türkiye, 2006.

İ. N. Bozkurt, O. Baghoglu, et al. "Authorship attribution performance of various features and classification methods". 22nd International Symposium on Computer and Information Sciences, 2007. DOI: https://doi.org/10.1109/ISCIS.2007.4456854

A. McCallum, K. Nigam, et al. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, volume 752, pages 41–48. Madison, WI, 1998.

B. Schölkopf, A. J. Smola, et al. "New support vector algorithms." Neural Computation, 12(5):1207–1245, 2000. DOI: https://doi.org/10.1162/089976600300015565

L. Breiman, J. H. Friedman, et al. Classification And Regression Trees. Routledge, October 2017. DOI: https://doi.org/10.1201/9781315139470

L. Breiman. Random forests. Machine learning, 45:5–32, 2001. DOI: https://doi.org/10.1023/A:1010933404324

P. Geurts, D. Ernst, L. Wehenkel. Extremely randomized trees. Machine learning, 63:3–42, 2006. DOI: https://doi.org/10.1007/s10994-006-6226-1

E. Şahin. Makine öğrenme yöntemleri ve kelime kümesi tekniği ile İstenmeyen e-posta/e-posta sınıflaması. Master’s thesis, Hacettepe Üniversitesi, 2018