Data Mining, Weka Decision Trees

Zekeriya Duran

Sivas Cumhuriyet University

https://orcid.org/0000-0002-9327-8567

İsmail Akargöl

Sivas Cumhuriyet University

https://orcid.org/0000-0002-0721-7064

Tuğba Doğan

Sivas Cumhuriyet University

https://orcid.org/0000-0002-2628-4238

DOI: https://doi.org/10.56038/oprd.v3i1.376

Keywords: Weka, decision trees, classification


Abstract

Nowadays, computer technologies are increasing rapidly. Thanks to the development of computer technologies, large and complex raw data sets can be transformed into useful information with different analysis techniques. Different algorithms developed thanks to computer technologies can offer different solutions to scientists and users working in different branches of science, especially engineering sciences, mathematics, medicine, industry, financial/economic fields, marketing, education, multimedia and statistics. Thanks to these solutions, it is possible to easily achieve the desired goals and objectives. Thus, by correctly managing and analyzing existing data in large and complex raw data datasets, accurate predictions can be made to be used in similar problems in the future. Data sets are analyzed and evaluated using different methods. It is also possible that the classification of data during the analysis and evaluation stages of data sets significantly affects the decision-making process regarding the work to be done. Classification of data can be done by statistical method or data mining method. Decision trees, which can be used to classify numerical and alphanumeric data, generally provide a great advantage for decision makers in terms of easy interpretation and understandability compared to other classification techniques. For these reasons, in this study, decision trees, one of the most used classification techniques in data mining, are mentioned.


References

Albayrak, A. S., Yılmaz, Ş. K. (2009). Veri madenciliği: Karar ağaç algoritmaları ve İMKB verileri üzerine bir uygulama. Süleyman Demirel Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi, 14(1), 31-52.

Czajkowski, M., Kretowski M. (2010). Globally induced model trees: an evolutionary approach. 11th International Conference on Parallel Problem Solving from Nature, September 11-15, 324-333, Krakow. DOI: https://doi.org/10.1007/978-3-642-15844-5_33

Göndör, M., Bresfelean, V. P. (2012). REPTree and M5P for measuring fiscal policy influences on the Romanian capital market during 2003-2010. International Journal of Mathematics and Computers in Simulation, 6(4), 378-386.

Aydemir, E. (2018). Weka ile yapay zekâ. Seçkin Yayınevi, 231s, Ankara.

Onan, A. (2015). Şirket iflaslarının tahmin edilmesinde karar ağacı algoritmalarının karşılaştırmalı başarım analizi. Bilişim Teknolojileri Dergisi, 8(1), 9-19. https//doi.org/10.17671/btd.36087. DOI: https://doi.org/10.17671/btd.36087

Friedman F., Hastie T., Tibshirani R. (2009). The elements of statistical learning data mining, inference and prediction, 2nd Ed., Springer series in Statistics, Springer, 745p, New York. DOI: https://doi.org/10.1007/978-0-387-84858-7

Küçükönder, H., Vursavuş, K. K., Üçkardeş, F. (2015). K-star, rastgele orman ve karar ağacı (C4.5) sınıflandırma algoritmaları ile domatesin renk olgunluğu üzerinde bazı mekanik özelliklerin etkisinin belirlenmesi. Türk Tarım - Gıda Bilim ve Teknoloji Dergisi, 3(5), 300-306. DOI: https://doi.org/10.24925/turjaf.v3i5.300-306.261

Shearer, C. (2000). The Crisp-DM model: the new blueprint for data mining. Journal of Data Warehousing, 5(4), 13-23.

Savaş, S., Topaloğlu, N., Yılmaz, M. (2012). Veri madenciliği ve Türkiye’deki uygulama örnekleri. İstanbul Ticaret Üniversitesi Fen Bilimleri Dergisi, 11(21), 1-23.

Bramer, M. (2007). Principles of data mining. Springer-Verlag London Ltd., 526p, London.

Gargano, M. L., Raggad, B. G. (1999). Data mining-a powerful information creating tool. OCLC Systems & Services, 15(2), 81-90. DOI: https://doi.org/10.1108/10650759910276381

Aydemir, E., Kaysi, F., Yavuz, M. (2020). İlaç satış verileri kullanılarak ağaç algoritmaları ile elde edilen gelirin tahmin edilmesi. Anatolian Journal of Computer Sciences, 5(1), 14-21.

Chien, C. F., Chen, L. F., (2008). Data mining to improve personnel selection and enhance human capital: A case study in high-technology industry, Expert Systems with Applications, 34(1), 280-290. DOI: https://doi.org/10.1016/j.eswa.2006.09.003

Albayrak, A. S., Yılmaz, Ş. K. (2009). Veri madenciliği: Karar ağaç algoritmaları ve İMKB verileri üzerine bir uygulama. Süleyman Demirel Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi, 14(1), 31-52.

Gorunescu, F. (2011). Data mining: concepts, models and techniques. Springer-Verlag, 370p, Heidelberg.

Gültepe, Y. (2019). Makine öğrenmesi algoritmaları ile hava kirliliği tahmini üzerine karşılaştırmalı bir değerlendirme. European Journal of Science and Technology, 16, 8-15. DOI: https://doi.org/10.31590/ejosat.530347

Chadha, P., Singh, G. N. (2012). Classification rules and genetic algorithm in data mining. Global Journal of Computer Science and Technology Software & Data Engineering, 12(15), 50-54.

Brownlee, J. (2016). Machine learning mastery with Weka, Machine Learning Mastery, 248p.

Aksu, G. (2018). Pisa başarısını tahmin etmede kullanılan veri madenciliği yöntemlerinin incelenmesi. Hacettepe Üniversitesi Eğitim Bilimleri Enstitüsü (Doktora Tezi), 162s, Ankara.

Saygılı, A., (2013). Veri madenciliği ile mühendislik fakültesi öğrencilerinin okul başarılarının analizi. Yıldız Teknik Üniversitesi Fen Bilimleri Enstitüsü (Yüksek Lisans Tezi), 129s, İstanbul.

Bruxella, J.M. D., Sadhana, S., Geetha, S. (2014). Categorization of data mining tools based on their types. International Journal of Computer Science and Mobile Computing, 3(3), 445-452.

Jović, A., Brkić, K., Bogunović, N. (2014). An overview of free software tools for general data mining. 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), May 26-30, 1-6, Opatija. DOI: https://doi.org/10.1109/MIPRO.2014.6859735

Kiranmai, S. A., Jaya Laxmi, A. J. (2018). Data mining for classification of power quality problems using WEKA and the effect of attributes on classification accuracy. Protection and Control of Modern Power Systems, 3, 1-12. https://doi.org/10.1186/s41601-018-0103-3. DOI: https://doi.org/10.1186/s41601-018-0103-3

Alfred, R., (2005). Knowledge discovery: enhancing data mining and decision support integration. The University of York (Qualifying Dissertation), 45p, York.

Alpar R. (2011). Uygulamalı çok değişkenli istatistiksel yöntemler. Detay Yayıncılık, 853s, Ankara.

Çınaroğlu, S. (2016). Sağlık harcamasının tahmininde klasik regresyon yöntemleri ile veri madenciliği regresyon yöntemlerinin karşılaştırılması. Ekonomik Yaklaşım, 27(101), 185-218.

Schober, P., Boer, C., Schwarte, L. A. (2018). Correlation coefficients: appropriate use and interpretation, Anesthesia & Analgesia, 126(5), 1763-1768. https://doi.org/10.1213/ANE.0000000000002864. DOI: https://doi.org/10.1213/ANE.0000000000002864

Sabti, A. A., Rashid, S. M., Hummadi, A. S. (2019). Interrelationships between writing anxiety dimensions and writing goal orientation among Iraqi EFL undergraduates, International Journal of Instruction, 12(4), 529-544, https://doi.org/10.29333/iji.2019.12434a. DOI: https://doi.org/10.29333/iji.2019.12434a

Tanni, S. E., Patino, C. M., Ferreira, J. C. (2020). Correlation vs. regression in association studies. Jornal Brasileiro de Pneumologia, 46(1): e20200030. https://doi.org/10.1590/1806-3713/e20200030. DOI: https://doi.org/10.1590/1806-3713/e20200030

Wang, W., Xu, Z. (2004). A heuristic training for support vector regression. Neurocomputing, 61: 259-275. https://doi.org/10.1016/j.neucom.2003.11.012. DOI: https://doi.org/10.1016/j.neucom.2003.11.012

Chai, T., Draxler, R. R. (2014). Root mean square error (RMSE) or mean absolute error (MAE)? - Arguments against avoiding RMSE in the literature. Geoscientific Model Development, 7, 1247-1250. DOI: https://doi.org/10.5194/gmd-7-1247-2014

Çınaroğlu, S. (2017). Sağlık harcamasının tahmininde makine öğrenmesi regresyon yöntemlerinin karşılaştırılması. Uludağ Üniversitesi Mühendislik Fakültesi Dergisi, 22(2), 179-200. DOI: https://doi.org/10.17482/uumfd.338805

Usha, T. M., Balamurugan, S. A. (2016). Seasonal based electricity demand forecasting using time series analysis. Circuits and Systems, 7(10), http://dx.doi.org/10.4236/cs.2016.710283. DOI: https://doi.org/10.4236/cs.2016.710283

Alsultanny, Y.A. (2020). Machine learning by data mining REPTree and M5P for predicating novel information for PM10. Cloud Computing and Data Science, 40-48. DOI: https://doi.org/10.37256/ccds.112020418

Akçetin, E., Çelik, U. (2014). İstenmeyen elektronik posta (spam) tespitinde karar ağacı algoritmalarının performans kıyaslaması. İnternet Uygulamaları ve Yönetimi Dergisi, 5(2), 43-56. https://doi.org/10.5505/iuyd.2014.43531. DOI: https://doi.org/10.5505/iuyd.2014.43531

Barros, R. C., de Carvalho, C. P. L. F. A., Freitas, A.A. (2015). Automatic design of decision-tree induction algorithms. SpringerBriefs in Computer Science, 176p, London. DOI: https://doi.org/10.1007/978-3-319-14231-9

Njeguš, A., Vanja Nikolić, V., Jovanović, V. (2015). The selection of optimal data mining method for small-sized hotels. International Scientific Conference of IT and Business-Related Research, April 16, 519-524, Belgrade. DOI: https://doi.org/10.15308/Synthesis-2015-519-524

Witten, I. H., Frank, E., Hall, M. A. (2011). Data mining: practical machine learning tools and techniques. Morgan Kaufmann Publishers, 665p, Burlington. https://doi.org/10.1016/C2009-0-19715-5. DOI: https://doi.org/10.1016/C2009-0-19715-5

Shah, T. N., Khan, M. Z., Ali, M., Khan, B., Idress, N. (2020). CART, J-48graft, J48, ID3, decision stump and random forest: a comparative study. University of Swabi Journal, 2(1), 1-6.

Srimani, P. K., Patil, M. M. (2015). Performance analysis of Hoeffding trees in data streams by using massive online analysis framework. International Journal of Data Mining Modelling and Management, 7(4), 293-313. http://dx.doi.org/10.1504/IJDMMM.2015.073865. DOI: https://doi.org/10.1504/IJDMMM.2015.073865

Saravanan, N., Gayathri, V. (2018). Performance and classification evaluation of J48 algorithm and Kendall’s based J48 algorithm (KNJ48). International Journal of Computer Trends and Technology, 59(2), 188-198. https://doi.org/10.14445/22312803/IJCTT-V59P112. DOI: https://doi.org/10.14445/22312803/IJCTT-V59P112

Landwehr, N. (2003). Logistic model trees. Computer Science at the University of Freiburg (Diploma Thesis), Germany, 104p, Freiburg. DOI: https://doi.org/10.1007/978-3-540-39857-8_23

Maulana, M. F., Defriani, M. (2020). Logistic model tree and decision tree J48 algorithms for predicting the length of study period, Journal Penelitian Ilmu Komputer, System Embedded & Logic, 8(1), 39-48. https://doi.org/10.33558/piksel.v8i1.2018. DOI: https://doi.org/10.33558/piksel.v8i1.2018

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324. DOI: https://doi.org/10.1023/A:1010933404324

Cutler, A., Cutler, D. R., Stevens, J. R. (2011). Random forests, Machine Learning, 45(1), 157-176. doi: 10.1007/978-1-4419-9326-7_5. DOI: https://doi.org/10.1007/978-1-4419-9326-7_5

Zhao, Y., Zhang, Y. (2008). Comparison of decision tree methods for finding active objects. Advances in Space Research, 41(12), 1955-1959. https://doi.org/10.1016/j.asr.2007.07.020. DOI: https://doi.org/10.1016/j.asr.2007.07.020

Quinlan J.R. (1992). Learning with continuous classes. 5th Australian Joint Conference on Artificial Intelligence, 343-348, Singapore.

del Campo-Avila J., Moreno-Vergara N., Trella-Lopez M. (2011). Analyzing factors to increase the influence of a Twitter user. Advances in Intelligent and Soft Computing, 89, 69-76. DOI: https://doi.org/10.1007/978-3-642-19917-2_9

Öztürk, E. (2012). Görüntü sıkıştırma yöntemlerinin etkinliğini arttıran dönüşüm ve bölümlendirme işlemleri. Trakya Üniversitesi Fen Bilimleri Enstitüsü (Yüksek Lisans Tezi), 84 s, Edirne.

Kara, Ş. E., Şamlı, R. (2021). Yazılım projelerinin maliyet tahmini için WEKA’da makine öğrenmesi algoritmalarının karşılaştırmalı analizi. Avrupa Bilim ve Teknoloji Dergisi, 23, 415-426. doi: 10.31590/ejosat.877296. DOI: https://doi.org/10.31590/ejosat.877296

Sihag, P., Singh, B., Said, A., Azamathulla, H. M. (2021). Prediction of Manning’s coefficient of roughness for high-gradient streams using M5P. Water Supply, 22(3), 2707-2720. https://doi.org/10.2166/ws.2021.440. DOI: https://doi.org/10.2166/ws.2021.440

Url-1 <https://stats.stackexchange.com/questions/228724/m5p-interpretations-and-questions> alındığı tarih: 20.05.2022.

Url-2 <https://community.rapidminer.com/discussion/440/the-regression-trees-returned-by-the-operators-w-m5p-and-w-reptree> alındığı tarih: 20.05.2022.

Url-3 <https://list.waikato.ac.nz/hyperkitty/list/wekalist@list.waikato.ac.nz/thread/AA5GPEFMQHXXDT6G4HCINHY52UHODW3Z> alındığı tarih: 20.05.2022.

Souza, J., Matwin, S., Japkowicz, N. (2002). Evaluating data mining models: a pattern language. 9th Conference on Pattern Language of Programs (PLOP’02), September 8-12, Monticello.

Ramageri, M. B. (2010). Data mining techniques and applications. Indian Journal of Computer Science and Engineering, 1(4), 301-305.

Bramer, M. (2013). Principles of data mining (2nd ed.), Springer-Verlag, 455p, London. DOI: https://doi.org/10.1007/978-1-4471-4884-5

Genç, B., Tunç, H. (2019). Optimal training and test sets design for machine learning, Turkish Journal of Electrical Engineering & Computer Sciences, 27, 1-13. doi:10.3906/elk-1807-212. DOI: https://doi.org/10.3906/elk-1807-212

Aksu, G., Doğan, N. (2019). An analysis program used in data mining: WEKA. Journal of Measurement and Evaluation in Education and Psychology, 10(1), 80-95. DOI: https://doi.org/10.21031/epod.399832

Turna, F., (2011). Veri Madenciliği Teknikleriyle Tramvay Arıza Kayıtlarından Kural Çıkarımı, Erciyes Üniversitesi, Fen Bilimleri Enstitüsü, Endüstri Mühendisliği Anabilim Dalı (Yüksek Lisans Tezi), 89 s, Kayseri.

Mohammed, A., Rafiq, S., Sihag, P., Kurda, R., Mahmood, W., Ghafor, K., Sarwar, W., (2020). ANN, M5P-tree and nonlinear regression approaches with statistical evaluations to predict the compressive strength of cement-based mortar modified with fly ash, Journal of Materials Research and Technology, 9(6):12416-12427. https://doi.org/10.1016/j.jmrt.2020.08.083 DOI: https://doi.org/10.1016/j.jmrt.2020.08.083

Behnood, A., Daneshvar, D., (2020). A machine learning study of the dynamic modulus of asphalt concretes: An application of M5P model tree algorithm, Construction and Building Materials 262, 120544, https://doi.org/10.1016/j.conbuildmat.2020.120544 DOI: https://doi.org/10.1016/j.conbuildmat.2020.120544

Yıldırım, M, O., (2021). Yelken Balığı Eniyileme Yaklaşımı ile Güçlendirilmiş Karar Ağacı Algoritması Kullanarak Kalp Rahatsızlıklarının Teşhisi, Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Endüstri Mühendisliği Anabilim Dalı (Yüksek Lisans Tezi), 64 s, Isparta.

Duran, Z., (2022). Bazı açık maden işletmelerinde partikül madde salınım ölçümü ve değişiminin meteorolojik koşullar, malzeme ve iş makinesi özellikleri ile modellenmesi, Sivas Cumhuriyet Üniversitesi Fen Bilimleri Enstitüsü Maden Mühendisliği Ana Bilim Dalı (Doktora Tezi), 380 s, Sivas.