Very often in my high school or University projects, we really preferred using accuracy to present the reliability of data. Turning it into a percentage allows easy interpretation and comparison. However, observing works created by data scientists, they often include other analysis such as classification report and confusion matrix. This article will discuss why they are useful in data analysis recalling experience from my own recent project.
Stroke prediction dataset
The aim of this project was to predict whether the person is likely to have stroke or not based on factors such as gender, age, various diseases, and smoking status. In this project, I created data visualisations to help the understanding of the raw data. I also conducted some exploratory data analysis to find out their datatypes and missing data. To create a machine learning model, I encoded the categorical data and splitted dataset into training and validation sets. Finally, I deployed a random forst classifier model to generate my predictions.
My model had an accuracy of 0.95 and my first impression was that the model was running quite well as the accuracy was quite high. However, when I generated my classification report, the real problem in the model showed.
The precision to identify non-stroke cases was 95% while the precision to identify stroke cases was 0%. From the recall column, the model was able to recall all the negative cases but was not able to predict any of the positive cases. (The precision for negative cases was not 100% because the model predicted all the stroke positive cases negative as well)
For more information, confusion matrix was also generated. True negative = 929, false negative = 53, true positive & false positive = 0.
So why does that happen? And how can we deal with that?
Imbalanced datasets
This occurs because the distribution of stroke cases is not approximately equal. The model could only capture the pattern of non-stroke cases but could not recognise the stroke cases, leading to significant bias when it comes to generating predictions.
Common examples of imbalanced datasets
It is actually very common to tackle imbalanced datasets in machine learning. Healthcare datasets use patients data to predict disease but the minority is usually the people who have the disease. When generating predictions for diseases, we should never be just aware of the overall accuracy as it is more important to look at how many successful predictions. Otherwise, the main purpose of the model is not achieved and may lead to serious outcomes if predictions are generated incorrectly.
Another example will be fraud detection. Distributions in fraud detection datasets are more extreme than healthcare dataset as millions or billions of transection can happen nationally per day but only a few were fraud transections. Oversampling should be done to ensure the balance between all farud and non-fraud data when training model.