KnE Social Sciences

ISSN: 2518-668X

The latest conference proceedings on humanities, arts and social sciences.

Indepth Analysis of Medical Dataset Mining: A Comparitive Analysis on a Diabetes Dataset Before and After Preprocessing

Published date: Sep 19 2019

Journal Title: KnE Social Sciences

Issue title: Annual PwR Doctral Symposium 2018–2019

Pages: 45–63

DOI: 10.18502/kss.v3i25.5190

Authors:

Latifa NassCollege of Business, Arts and Social Sciences, Brunel University London, UK

Stephen SwiftCollege of Engineering, Design and Physical Sciences, Brunel University London, UK

Ammar Al DallalCollege of Engineering, Ahlia University, Bahrain

Abstract:

Most of the healthcare organizations and medical research institutions store their patient’s data digitally for future references and for planning their future treatments. This heterogeneous medical dataset is very difficult to analyze due to its complexity and volume of data, in addition to having missing values and noise which makes this mining a tedious task. Efficient classification of medical dataset is a major data mining problem then and now. Diagnosis, prediction of diseases and the precision of results can be improved if relationships and patterns from these complex medical datasets are extracted efficiently. This paper analyses some of the major classification algorithms such as C4.5 ( J48), SMO, Naïve Bayes, KNN Classification algorithms and Random Forest and the performance of these algorithms are compared using WEKA. Performance evaluation of these algorithms is based on Accuracy, Sensitivity and Specificity and Error rate. The medical data set used in this study are Heart-Statlog Medical Data Set which holds medical data related to heart disease and Pima Diabetes Dataset which holds data related to Diabetics. This study contributes in finding the most suitable algorithm for classifying medical data and also reveals the importance of preprocessing in improving the classification performance. Comparative study of various performances of machine learning algorithms is done through graphical representation of the results.

Keywords: Data Mining, Health Care, Classification Algorithms, Accuracy, Sensitivity, Specificity, Error Rate

References:

[1] Ms. A. Malarvizhi, Dr. S. Ravichandran,” Data Mining’s Role in Mining Medical Datasets for Disease Assessments – a Case Study”, International Journal of Pure and Applied Mathematics, Volume 119 No. 12 2018.

[2] P. Jaganathan and R. Kuppuchamy, “A threshold fuzzy entropy based feature selection for medical database classification,” Computers in Biology and Medicine, vol. 43, no. 12, pp. 2222–2229, 2013.

[3] Umair Shafique, Haseeb Qaiser,”A Comparative Study of Data Mining Process Models”, - International Journal of Innovation and Scientific Research, Vol. 12 No. 1, Nov. 2014.

[4] Dr. T. Karthikeyan, Dr. B. Ragavan, V.A.Kanimozhi, A Study on Data mining Classification Algorithms in Heart Disease Prediction, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) Volume 5, Issue 4, April 2016.

[5] Yong ZENG, H.-m. F.-p.-y. (2016). An Improved ML-kNN Algorithm by Fusing Nearest Neighbor Classification”. International Conference on Artificial Intelligence and Computer Science (AICS 2016).

[6] Shaifali Gupta, R. R. (2016). Improvement in KNN Classifier (imp-KNN) for Text Categorization”,. International Journal of Advanced Research in Computer Science and Software Engineering, Volume 6.

[7] Vikas Chaurasia and Saurabh Pa, Performance analysis of Diagnosis and Prediction of Heart and Breast Cancer Disease, Review of research, Vol 3,Issue 3 May 2014.

[8] N. Amma, “Cardiovascular disease prediction system using genetic algorithm and neural network,” in International Conference on Computing, Communication and Applications. Dindigul, Tamilnadu, India:IEEE, Feb 2012, pp. 1–5.

[9] W. Wiharto, H. Kusnanto, and H. Herianto, “Performance analysis of multiclass support vector machine classification for diagnosis of coronary heart diseases,” International Journal on Computational Science & Applications, vol. 5, no. 5, pp. 27–37, 2015.

[10] Jaganathan P., Kuppuchamy R. A threshold fuzzy entropy based feature selection for medical database classification. Computers in Biology and Medicine. 2013;43(12).

[11] C. V. Subbulakshmi and S. N. Deepa, Medical Dataset Classification: A Machine Learning Paradigm Integrating Particle Swarm Optimization with Extreme Learning Machine Classifier, Scientific World Journal, September 2015.

[12] Ms. Ishtake S.H, Prof. Sanap S.A. “Intelligent Heart Disease Prediction System Using Data Mining Techniques”, International J. of Healthcare & Biomedical Research, Volume: 1, Issue: 3, April 2013.

[13] Chaitrali S. Dangare Sulabha, “ Improved Study of Heart Disease Prediction System using Data Mining Classification Techniques”, International Journal of Computer Applications (0975 – 888) Volume 47– No.10, June 2012.

[14] Jyoti Soni, Ujma Ansari, Dipesh Sharma, “ Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction”, International Journal of Computer Applications (0975 – 8887) Volume 17– No.8, March 2011.

[15] AH Chen, SY Huang, PS Hong, CH Cheng, EJ lin, “HDPS: Heart Disease Prediction System, Computing in cardiology”, 2011: 38:557- 560.

[16] Vikas Chaurasia, Saurabh Pal, “ Early Prediction of Heart Diseases using Data Mining Techniques”, Caribbean Journal of Science & Technology, ISSN 0799-3757.

[17] Andrea D’Souza, “Heart Disease Prediction Using Data Mining Techniques”, International Journal of Research in Engineering and Science (IJRES) ISSN (Online): 2320-9364, ISSN (Print): 2320-9356.

[18] Milan Kumari, Sunila Godara, “ Comparative Study of Data Mining Classification Methods in Cardiovascular Disease Prediction”, International Journal of Computer Science and Technology, IJCST Vol. 2, Issue 2, June 2011.

[19] Abhishek Taneja, “Heart Disease Prediction System Using Data Mining Techniques”, Oriental Journal Of Computer Science & Technology, ISSN: 0974-6471 December 2013, Vol. 6, No. (4).

[20] Sellappan Palaniappan, Rafiah Awang, “Intelligent Heart Disease Prediction System Using Data Mining Techniques” IJCSNS International.

[21] K. Saravananathan1 and T. Velmurugan, Analyzing Diabetic Data using Classification Algorithms in Data Mining, Indian Journal of Science and Technology, Vol 9(43).

[22] Saman Hina, Anita Shaikh and Sohail Abul Sattar, Analyzing Diabetes Datasets using Data Mining, Journal of Basic & Applied Sciences, 2017, 13, 466-471.

[23] Aiswarya Iyer, S. Jeyalatha and Ronak Sumbaly, DIAGNOSIS OF DIABETES USING CLASSIFICATION MINING TECHNIQUES, International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015.

[24] R. Sivanesan, K. Devika Rani Dhivya, A Review on Diabetes Mellitus diagnoses using classification on Pima Indian Diabetes Data Set, International Journal of Advance Research in Computer Science and Management Studies, Volume 5, Issue 1, January 2017.

[25] J.Anitha, Dr.A.Pethalakshmi, Comparison of Classification Algorithms in Diabetic Dataset, International Journal of Information Technology (IJIT) – Volume 3 Issue 3, May-Jun 2017.

[26] Meraj Nabi, Pradeep Kumar, Abdul Wahid, Performance Analysis of Classification Algorithms in Predicting Diabetes, International Journal of Advanced Research in Computer Science Volume 8, No. 3, March –April 2017.

[27] A. K. Santra, C. Josephine Christy,” Genetic Algorithm and Confusion Matrix for Document Clustering”, IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 2, January 2012.

[28] World Health Organization,“Cardiovascular diseases (CVDS)”, https://www.who.int/news-room/factsheets/ detail/cardiovascular-diseases-(cvds), May 2017.

[29] Leif E. Peterson,“ K-nearest neighbor”, Scholarpedia, 2009.

[30] UCI Machine Learning, (http://archive.ics.uci.edu/ml/index.php).

Download
HTML
Cite
Share
Crossref Cited-by logo

9

Brahami Menaouer, Abdeldjouad Fatma Zahra, Sabri Mohammed (2022)

Multi-Class Sentiment Classification for Healthcare Tweets Using Supervised Learning Techniques, International Journal of Service Science, Management, Engineering, and Technology

Volume: 13, Issue: 1, First Page: 1

10.4018/IJSSMET.298669

Vaman A. Saeed, Nawzat Sadiq Ahmed, Bareen Haval Sadiq (2024)

Comparative Analysis of Preprocessing Techniques for KNN Classification on the Diabetes Dataset,

Volume: 1058, First Page: 213

10.1007/978-3-031-65522-7_20

Hakan Güler, Derya Avcı, Mustafa Ulaş, Tülay Omma (2025)

Diyabet hastalığı teşhisinde makine öğrenimi modelleri ile açıklanabilir yapay zeka yöntemlerinin analizi, Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi

Volume: 40, Issue: 3, First Page: 1995

10.17341/gazimmfd.1552790

Farnaz Zeidi, Lalah Azar, Vasfiye Arslan, Çiğdem Erol (2023)

A Hybrid Model Focusing on Data Pre-Processing in Diabetes Diagnosis, Cybernetics and Systems

Volume: 54, Issue: 7, First Page: 1199

10.1080/01969722.2022.2080338

Hairani Hairani, Khurniawan Eko Saputro, Sofiansyah Fadli (2020)

K-means-SMOTE for handling class imbalance in the classification of diabetes with C4.5, SVM, and naive Bayes, Jurnal Teknologi dan Sistem Komputer

Volume: 8, Issue: 2, First Page: 89

10.14710/jtsiskom.8.2.2020.89-93

Sreekumar, Swati Das, Bikash Ranjan Debata, Rema Gopalan, Shakir Khan (2024)

Diabetes Prediction: A Comparison Between Generalized Linear Model and Machine Learning,

Volume: 1132, First Page: 57

10.1007/978-981-99-8853-2_4

Ahmed Hamza Osman, Ashraf Osman Ibrahim, Abeer Alsadoon, Ahmad A Alzahrani, Omar Mohammed Barukub, Anas W. Abulfaraj, Nesreen M. Alharbi (2024)

Breaking new ground in cardiovascular heart disease Diagnosis K-RFC: An integrated learning approach with K-means clustering and Random Forest classifier, AIMS Mathematics

Volume: 9, Issue: 4, First Page: 8262

10.3934/math.2024402

Bareen Haval Sadiq, Nawzat Sadiq Ahmed, Omar Muhammed Ahmed (2024)

Comprehensive Analysis of Iris Dataset Using K-Mean and Fuzzy K-Mean Clustering Algorithm,

Volume: 1058, First Page: 75

10.1007/978-3-031-65522-7_7