Flyries

Very often in my high school or University projects, we really preferred using accuracy to present the reliability of data. Turning it into a percentage allows easy interpretation and comparison. However, observing works created by data scientists, they often include other analysis such as classification report and confusion matrix. This article will discuss why they are useful in data analysis recalling experience from my own recent project.

Stroke prediction dataset

The aim of this project was to predict whether the person is likely to have stroke or not based on factors such as gender, age, various diseases, and smoking status. In this project, I created data visualisations to help the understanding of the raw data. I also conducted some exploratory data analysis to find out their datatypes and missing data. To create a machine learning model, I encoded the categorical data and splitted dataset into training and validation sets. Finally, I deployed a random forst classifier model to generate my predictions.

My model had an accuracy of 0.95 and my first impression was that the model was running quite well as the accuracy was quite high. However, when I generated my classification report, the real problem in the model showed.

The precision to identify non-stroke cases was 95% while the precision to identify stroke cases was 0%. From the recall column, the model was able to recall all the negative cases but was not able to predict any of the positive cases. (The precision for negative cases was not 100% because the model predicted all the stroke positive cases negative as well)

For more information, confusion matrix was also generated. True negative = 929, false negative = 53, true positive & false positive = 0.

So why does that happen? And how can we deal with that?

Imbalanced datasets

This occurs because the distribution of stroke cases is not approximately equal. The model could only capture the pattern of non-stroke cases but could not recognise the stroke cases, leading to significant bias when it comes to generating predictions.

The most common way is to resample the data again by using oversampling. This means the stroke positive cases will be duplicated randomly by rows to match the count of stroke negative cases. This ensures enough sample for machine learning models to capture patterns unbiasedly. Another alternative method is undersampling. By reducing the data count of negative stroke cases again randomly to match the count of positive stroke cases. This is a less common method as it can reduce sample size significantly depending on the sample size of the minority, leading to some possible valuable data was not corporated in training model.

To do resampling in python, imbalanced-learn library is the easiest way to handle imbalanced dataset. SMOTE is a common method to use, that is referring to Synthetic Minority Over-sampling Technique. In my project, I wrote my own lines to do resampling as I could not call the imbalanced-learn library in Kaggle.

Common examples of imbalanced datasets

It is actually very common to tackle imbalanced datasets in machine learning. Healthcare datasets use patients data to predict disease but the minority is usually the people who have the disease. When generating predictions for diseases, we should never be just aware of the overall accuracy as it is more important to look at how many successful predictions. Otherwise, the main purpose of the model is not achieved and may lead to serious outcomes if predictions are generated incorrectly.

Another example will be fraud detection. Distributions in fraud detection datasets are more extreme than healthcare dataset as millions or billions of transection can happen nationally per day but only a few were fraud transections. Oversampling should be done to ensure the balance between all farud and non-fraud data when training model.

To access my whole project on Kaggle:

https://www.kaggle.com/code/reannatse/stroke-prediction#7.-Train-model-and-generate-prediction

Menu

2025年11月4日星期二

Why is accuracy not always a good indication of the machine learning model performance?

Popular