KnE Social Sciences

ISSN: 2518-668X

The latest conference proceedings on humanities, arts and social sciences.

A New Similarity Measure for Document Classification and Text Mining

Published date:Jan 12 2020

Journal Title: KnE Social Sciences

Issue title: Economies of the Balkan and Eastern European Countries

Pages:353–366

DOI: 10.18502/kss.v4i1.5999

Authors:

Mete Eminağaoğlumete.eminagaoglu@deu.edu.trDept. of Computer Science, Dokuz Eylül University, Tınaztepe, Buca, İzmir, Turkey

Yılmaz GökşenDept. of Management Information Systems, Dokuz Eylül University, Buca, İzmir, Turkey

Abstract:

Accurate, efficient and fast processing of textual data and classification of electronic documents have become an important key factor in knowledge management and related businesses in today’s world. Text mining, information retrieval, and document classification systems have a strong positive impact on digital libraries and electronic content management, e-marketing, electronic archives, customer relationship management, decision support systems, copyright infringement, and plagiarism detection, which strictly affect economics, businesses, and organizations. In this study, we propose a new similarity measure that can be used with k-nearest neighbors (k-NN) and Rocchio algorithms, which are some of the well-known algorithms for document classification, information retrieval, and some other text mining purposes. We have tested our novel similarity measure with some structured textual data sets and we have compared the results with some other standard distance metrics and similarity measures such as Cosine similarity, Euclidean distance, and Pearson correlation coefficient. We have obtained some promising results, which show that this proposed similarity measure could be alternatively used within all suitable algorithms, methods, and models for text mining, document classification, and relevant knowledge management systems.

Keywords: text mining, document classification, similarity measures, k-NN, Rocchio algorithm

References:

[1] Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, vol. 34, pp. 1-47.

[2] Jurafksy, D. and Martin, J. H. (2017). Speech and Language Processing. USA: Prentice Hall.

[3] Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. USA: The MIT Press.

[4] Pant, G. and Srinivasan, P. (2005). Learning to crawl: Comparing classification schemes. ACM Transactions on Information Systems, vol. 23, pp. 430-462.

[5] Zhang, L. et al. (2004). An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing, vol. 3, pp. 243-269.

[6] McCallum, A. and Nigam, K. (1998). A comparison of event models for Naive Bayes text classification, in Proceedings of AAAI/ICML-98 Workshop on Learning for Text Categorization, Madison, Wisconsin, USA: The AAAI Press.

[7] Rennie, J. D. M. et al. (2003). Tackling the poor assumptions of Naive Bayes text classification, in Proceedings of the Twentieth International Conference on Machine Learning, Washington D.C., USA: The AAAI Press.

[8] Komiya, K. et al. (2011). Negation Naive Bayes for Categorization of Product Pages on the Web, in Proceedings of Recent Advances in Natural Language Processing, Hissar, Bulgaria.

[9] Lodhi, H. et al. (2002). Text Classification using String Kernels. Journal of Machine Learning Research, vol. 2, pp. 419-444.

[10] Yu, H. et al. (2002). PEBL: positive example based learning for web page classification using SVM, in Proceedings of the Eighth International Conference on Knowledge discovery and Data Mining, Edmonton, Canada.

[11] Martin-Valdivia, M. T. et al. (2007). The learning vector quantization algorithm applied to automatic text classification tasks. Neural Networks, vol. 20, no. 6, pp. 748-756.

[12] Chen, C. et al. (2005). A Hierarchical Neural Network Document Classifier with Linguistic Feature Selection. Applied Intelligence, vol. 23, pp. 277-294.

[13] Liu, L. and Peng, T. (2014). Clustering-based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF. Journal of Information Science and Engineering, vol. 30, pp. 1463-1481.

[14] Kwon, O. and Lee, J. (2003). Text categorization based on k-nearest neighbor approach for web site classification. Information Processing and Management, vol. 39, pp. 25-44.

[15] Manning, C. D. et al. (2009). Introduction to Information Retrieval. UK: Cambridge University Press.

[16] Holzinger, A. et al. (2014). Biomedical Text Mining: State-of-the-Art, Open Problems and Future Challenges. Knowledge Discovery and Data Mining, pp. 271–300.

[17] Aha, D. W. et al. (1991). Instance-Based Learning Algorithms. Machine Learning, vol. 6, no. 1, pp. 37-66.

[18] Rocchio, J. (1971). The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, USA.

[19] K

Cited by
?
ISLANDS’ TOURISM SEASONALITY: A DATA ANALYSIS OF MEDITERRANEAN ISLANDS’ TOURISM COMPARING SEASONALITY INDICATORS (2008–2018)
Giovanni Ruggieri et al., SUSTAINABILITY, 2024
MODELING SUSTAINABLE CITY TRIPS: INTEGRATING CO2 EMISSIONS, POPULARITY, AND SEASONALITY INTO TOURISM RECOMMENDER SYSTEMS
Ashmi Banerjee et al., ARXIV, 2024
SPATIAL ANALYSIS OF SEASONAL AND TREND PATTERNS IN ROMANIAN AGRITOURISM ARRIVALS USING SEASONAL-TREND DECOMPOSITION USING LOESS
Marius-Ionuț Gordan et al., AGRICULTURE, 2024
ANALYZING THE TOURISM SEASONALITY FOR THE MEDITERRANEAN COUNTRIES
Thomas Krabokoukis et al., JOURNAL OF THE KNOWLEDGE ECONOMY, 2023
SEASONALITY AND SUSTAINABILITY OF TOURISM – CASE STUDY
S. Pavlović et al., DELA, 2022
Recommendations
THE IMPORTANCE OF ADAPTABILITY IN KNOWLEDGE TECHNOLOGY AND ACCOUNTING IN TRADITIONAL RESTAURANT
Ikrar Agung Dewantoro et al., KNE SOCIAL SCIENCES, 2023
INTERCULTURAL COMMUNICATION INTERACTION OF MULTICULTURAL SOCIETY IN WEST KALIMANTAN PROVINCE: ETHNOGRAPHIC STUDIES
Theresi Fannia et al., KNE SOCIAL SCIENCES, 2023
ASSESSING FACTORS CONTRIBUTE TO UNCLAIMED PROPERTIES IN SELANGOR--POST-PANDEMIC SCENARIO
Mohd Zulkifli Muhammad et al., KNE SOCIAL SCIENCES, 2023
THE EFFECT OF INNOVATION CAPABILITY ON MARKET PERFORMANCE MEDIATED BY EXTERNAL COLLABORATION ON SMES
Mohamad Trio Febriyantoro et al., KNE SOCIAL SCIENCES, 2023
DISCOURSES OF ISLAMIC FINANCE SUPPORTING IN MUSLIM-FRIENDLY TOURISM IN THE NEW NORMAL ERA (INDONESIA CASES)
Lucky Nugroho et al., KNE SOCIAL SCIENCES, 2023
Powered by
Download
HTML
Cite
Share
statistics

2175 Abstract Views

942 PDF Downloads