If you came to my article to get to know more about EDA, at this point, I think you might have a clue on what I’m talking about. For those, skip to the next topic, but for the others that don’t know what EDA is, stick a little bit with me.
Exploratory Data Analysis refers to an approach of analyzing and summarizing the data often with visual methods. We want to use data as a tool to help us solve problems and make better decisions, but when we use it to validate our own assumptions, we are going in the wrong direction. The idea behind this is to understand the data without biased assumptions. We just want to look at then the way it is. Always remember:
This is a time-consuming process but if it’s done properly, it will save you some precious time and energy down the road and will help you make better choices.
Now that you know what is EDA and the importance of it, we can progress towards performing it. Here are some steps that will guide you through.
- Understand the dataset — assess the quality of the dataset
- Distribution of the dataset — how does the data look like?
- Correlations — find patterns in the dataset
This first step will help us describe the basics of the data and get a summary of it. First, let’s import the dataset. For this example, I’m using a dataset that I have downloaded from Kaggle.
Load the dataset
import pandas as pddataset = pd.read_csv("country_vaccinations.csv")
Then I like to see how the dataset looks like. The head function will allow showing the first five rows.
dataset.head()
Count the number of rows
I’d like to see how big is the dataset. There are some ways of doing this, one way is using index.
dataset.index
Remove any unnecessary columns
If you have any column/feature that you don’t find relevant, it’s possible to remove it. I will be removing “iso_code”, “source_name” and “source_website” columns because they won’t help me with the analysis. Just make sure that it won’t interfere with your analysis, if you are not sure, just keep it the way it is.
columns = ["iso_code", "source_name", "source_website"] #define the columns to be removed
dataset_clean = dataset.drop(columns, axis=1)
Check for null values
After that, let’s check the amount of missing values for each feature.
dataset_clean.isnull().sum()
We notice here that there are some null values in the respective columns. We can use some approaches to deal with this:
- Drop missing values.
- Replace with mean, median, or mode values.
For now, to simplify things, I will drop the rows that contain null values.
dataset_clean = dataset_clean.dropna
Check the data type
Now let’s check what are the data types for each column.
dataset_clean.dtypes
We can see that we have objects and floats in the data frame. Now let’s get to the next step.
For this part, I have two approaches. A super simple one that takes literally two lines of code and a more laborious one. I usually do both of them since the former is so easy and if I see that it’s not enough, I just do the latter.
Using Sweetviz
Sweetviz is a python library that can do exploratory data analysis in very few lines of code. I will show you how to install it and how do I use it.
Installation:
Install the Sweetviz library using pip:
pip install sweetviz
Generating the report:
Now let’s generate the report. For this part, I will set “daily_vaccinations” as the feature target. When you run this code, it will generate the report as an HTML file. If you prefer to generate it inside a notebook, simply change from report.show_html() to report.show_notebook()
import sweetviz as svreport = sv.analyze(dataset_clean, target_feat = "daily_vaccinations")
report.show_html()
It will generate some basic visualizations such as a correlation matrix (for both categorical and numerical variables) and histograms. Sweetviz is a great way for beginners to do an exploratory data analysis.
For us, we want to explore more the data and look for distributions, and compare groups to give us some hints and insights for the analysis.
Histograms
Histograms are a great way of visualizing the distribution of the data. We can use it as a way to compare and confirm results and stratify the data. This is really useful if you want to run a hypothesis test. Depending on how your data is distributed, it will change the type of test you will run. If you want to know more about histograms, there is a great video from StatQuest on Youtube.
For this analysis, I want to compare the distribution of daily vaccinations between Wales and Israel. Firstly, I want to filter the data frames to contain data only from the countries that I want to analyze.
# filter the dataframes by countrywales = dataset_clean[dataset_clean["country"] == 'Wales']
israel = dataset_clean[dataset_clean["country"] == 'Israel']
Then, I will plot the histograms using hist() function from pandas.
wales.hist(column="daily_vaccinations")
israel.hist(column="daily_vaccinations")
We observe here that the histogram from Wales is skewed to the right compared to Israel. We could assume that Wales is vaccinating more per day than Israel but if we look carefully at the X-axis, the range of vaccinations per day from Wales is 5.000 to 28.000 compared to 60.000 to 190.000 from Israel, therefore, Israel is vaccinating way more people per day than Wales.
Box Plots
Box plots are useful for observing outliers in the dataset and their variability. The figure below explains how a box plot works. There is also another great video from StatQuest that explains what a box plot is.
We will use the box plot to analyze how the data is distributed and if we have any outliers. Again, we will be using data from Wales and Israel.
wales.boxplot(column=["daily_vaccinations"])
israel.boxplot(column=["daily_vaccinations"])
The first thing we notice is the absence of outliers for both countries. The second thing we notice is that the median (represented by the green line inside the box) between the two countries is very different. For Wales, the median is around 21.000 and for Israel is around 118.000, almost six times more.
Median — which represents the middle number in a sorted list of numbers — is especially useful when we have a lot of outliers in the data. On the other hand, mean is very sensitive with outliers. As an example, if someone counted wrongly the number of vaccinations per day in Wales and inputted 200.000 instead of 20.000, it will impact a lot in the mean but not in the median.
In case of a presence of outliers, it’s necessary to evaluate if it’s a wrong measurement or a unique circ*mstance. Outliers are not always a bad thing, we can analyze them separately, and can give us some insights into our analysis.
Correlation is a great way to analyze if there is any interdependence between the variables.
It’s important to always remember, correlation does not imply causation.
With this in mind, let’s plot the correlation matrix. For this part, I will be using the corr() function and the seaborn library to plot the graph.
import matplotlib.pyplot as plt
import seaborn as snscorr = dataset_clean.corr()
plt.figure(figsize=(12, 10))sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.4)],
cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=True, annot_kws={"size": 8}, square=True);
Before analyzing the correlation matrix, let me explain how to read it. The Pearson correlation coefficient has a value between -1 and 1.
- 0 indicates no linear relationship.
- 1 indicates a perfect positive linear correlation, which means that if one variable goes up, the interdependent variable will go up as well.
- -1 indicates a perfect negative linear correlation, which means that if one variable goes up, the interdependent variable will go down.
Analyzing the matrix makes total sense. As an example, the correlation between daily vaccinations and total vaccinations is 1. It means that the higher the number of daily vaccinated people, the higher is the number of total vaccination, which is totally true.
For this case, the correlations are pretty obvious, but for other datasets, this might end up giving you interesting insights to further the analysis.
With this data, it’s possible to move the analysis into different directions:
- Hypothesis testing
- Gather more data
- Causal inference
We have noticed that Israel vaccinates way more people per day than Wales. The difference between the means of those two countries are big, so we can be pretty sure that those differences are significant. If those differences were small, we could perform a hypothesis test to check if the difference is significant or not.
Knowing that Israel vaccinates more, we could move the analysis to gather more data and seek to find causations of why they are so successful in the vaccination against COVID-19.
As you may have probably noticed, I’ve tried to keep it as simple as possible. This is how I usually do my EDA but this is by no means what you need to follow. There is no one size fits all EDA. Adjust it according to your needs but always remember to analyze it as unbiased as possible.