A Practical Guide to Exploratory Data Analysis (EDA) in Python — How to Start Any Data Analysis. (2024)

If you came to my article to get to know more about EDA, at this point, I think you might have a clue on what I’m talking about. For those, skip to the next topic, but for the others that don’t know what EDA is, stick a little bit with me.

Exploratory Data Analysis refers to an approach of analyzing and summarizing the data often with visual methods. We want to use data as a tool to help us solve problems and make better decisions, but when we use it to validate our own assumptions, we are going in the wrong direction. The idea behind this is to understand the data without biased assumptions. We just want to look at then the way it is. Always remember:

A Practical Guide to Exploratory Data Analysis (EDA) in Python — How to Start Any Data Analysis. (1)

This is a time-consuming process but if it’s done properly, it will save you some precious time and energy down the road and will help you make better choices.

Now that you know what is EDA and the importance of it, we can progress towards performing it. Here are some steps that will guide you through.

  • Understand the dataset — assess the quality of the dataset
  • Distribution of the dataset — how does the data look like?
  • Correlations — find patterns in the dataset

This first step will help us describe the basics of the data and get a summary of it. First, let’s import the dataset. For this example, I’m using a dataset that I have downloaded from Kaggle.

Load the dataset

import pandas as pddataset = pd.read_csv("country_vaccinations.csv")

Then I like to see how the dataset looks like. The head function will allow showing the first five rows.

dataset.head()
A Practical Guide to Exploratory Data Analysis (EDA) in Python — How to Start Any Data Analysis. (2)

Count the number of rows

I’d like to see how big is the dataset. There are some ways of doing this, one way is using index.

dataset.index
A Practical Guide to Exploratory Data Analysis (EDA) in Python — How to Start Any Data Analysis. (3)

Remove any unnecessary columns

If you have any column/feature that you don’t find relevant, it’s possible to remove it. I will be removing “iso_code”, “source_name” and “source_website” columns because they won’t help me with the analysis. Just make sure that it won’t interfere with your analysis, if you are not sure, just keep it the way it is.

columns = ["iso_code", "source_name", "source_website"] #define the columns to be removed
dataset_clean = dataset.drop(columns, axis=1)

Check for null values

After that, let’s check the amount of missing values for each feature.

dataset_clean.isnull().sum()
A Practical Guide to Exploratory Data Analysis (EDA) in Python — How to Start Any Data Analysis. (4)

We notice here that there are some null values in the respective columns. We can use some approaches to deal with this:

  • Drop missing values.
  • Replace with mean, median, or mode values.

For now, to simplify things, I will drop the rows that contain null values.

dataset_clean = dataset_clean.dropna

Check the data type

Now let’s check what are the data types for each column.

dataset_clean.dtypes
A Practical Guide to Exploratory Data Analysis (EDA) in Python — How to Start Any Data Analysis. (5)

We can see that we have objects and floats in the data frame. Now let’s get to the next step.

For this part, I have two approaches. A super simple one that takes literally two lines of code and a more laborious one. I usually do both of them since the former is so easy and if I see that it’s not enough, I just do the latter.

Using Sweetviz

Sweetviz is a python library that can do exploratory data analysis in very few lines of code. I will show you how to install it and how do I use it.

Installation:

Install the Sweetviz library using pip:

pip install sweetviz

Generating the report:

Now let’s generate the report. For this part, I will set “daily_vaccinations” as the feature target. When you run this code, it will generate the report as an HTML file. If you prefer to generate it inside a notebook, simply change from report.show_html() to report.show_notebook()

import sweetviz as svreport = sv.analyze(dataset_clean, target_feat = "daily_vaccinations")
report.show_html()

It will generate some basic visualizations such as a correlation matrix (for both categorical and numerical variables) and histograms. Sweetviz is a great way for beginners to do an exploratory data analysis.

For us, we want to explore more the data and look for distributions, and compare groups to give us some hints and insights for the analysis.

Histograms

Histograms are a great way of visualizing the distribution of the data. We can use it as a way to compare and confirm results and stratify the data. This is really useful if you want to run a hypothesis test. Depending on how your data is distributed, it will change the type of test you will run. If you want to know more about histograms, there is a great video from StatQuest on Youtube.

For this analysis, I want to compare the distribution of daily vaccinations between Wales and Israel. Firstly, I want to filter the data frames to contain data only from the countries that I want to analyze.

# filter the dataframes by countrywales = dataset_clean[dataset_clean["country"] == 'Wales']
israel = dataset_clean[dataset_clean["country"] == 'Israel']

Then, I will plot the histograms using hist() function from pandas.

wales.hist(column="daily_vaccinations")
A Practical Guide to Exploratory Data Analysis (EDA) in Python — How to Start Any Data Analysis. (7)
israel.hist(column="daily_vaccinations")
A Practical Guide to Exploratory Data Analysis (EDA) in Python — How to Start Any Data Analysis. (8)

We observe here that the histogram from Wales is skewed to the right compared to Israel. We could assume that Wales is vaccinating more per day than Israel but if we look carefully at the X-axis, the range of vaccinations per day from Wales is 5.000 to 28.000 compared to 60.000 to 190.000 from Israel, therefore, Israel is vaccinating way more people per day than Wales.

Box Plots

Box plots are useful for observing outliers in the dataset and their variability. The figure below explains how a box plot works. There is also another great video from StatQuest that explains what a box plot is.

A Practical Guide to Exploratory Data Analysis (EDA) in Python — How to Start Any Data Analysis. (9)

We will use the box plot to analyze how the data is distributed and if we have any outliers. Again, we will be using data from Wales and Israel.

wales.boxplot(column=["daily_vaccinations"])
A Practical Guide to Exploratory Data Analysis (EDA) in Python — How to Start Any Data Analysis. (10)
israel.boxplot(column=["daily_vaccinations"])
A Practical Guide to Exploratory Data Analysis (EDA) in Python — How to Start Any Data Analysis. (11)

The first thing we notice is the absence of outliers for both countries. The second thing we notice is that the median (represented by the green line inside the box) between the two countries is very different. For Wales, the median is around 21.000 and for Israel is around 118.000, almost six times more.

Median — which represents the middle number in a sorted list of numbers — is especially useful when we have a lot of outliers in the data. On the other hand, mean is very sensitive with outliers. As an example, if someone counted wrongly the number of vaccinations per day in Wales and inputted 200.000 instead of 20.000, it will impact a lot in the mean but not in the median.

In case of a presence of outliers, it’s necessary to evaluate if it’s a wrong measurement or a unique circ*mstance. Outliers are not always a bad thing, we can analyze them separately, and can give us some insights into our analysis.

Correlation is a great way to analyze if there is any interdependence between the variables.

It’s important to always remember, correlation does not imply causation.

With this in mind, let’s plot the correlation matrix. For this part, I will be using the corr() function and the seaborn library to plot the graph.

import matplotlib.pyplot as plt
import seaborn as sns
corr = dataset_clean.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.4)],
cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=True, annot_kws={"size": 8}, square=True);
A Practical Guide to Exploratory Data Analysis (EDA) in Python — How to Start Any Data Analysis. (12)

Before analyzing the correlation matrix, let me explain how to read it. The Pearson correlation coefficient has a value between -1 and 1.

  • 0 indicates no linear relationship.
  • 1 indicates a perfect positive linear correlation, which means that if one variable goes up, the interdependent variable will go up as well.
  • -1 indicates a perfect negative linear correlation, which means that if one variable goes up, the interdependent variable will go down.

Analyzing the matrix makes total sense. As an example, the correlation between daily vaccinations and total vaccinations is 1. It means that the higher the number of daily vaccinated people, the higher is the number of total vaccination, which is totally true.

For this case, the correlations are pretty obvious, but for other datasets, this might end up giving you interesting insights to further the analysis.

With this data, it’s possible to move the analysis into different directions:

  • Hypothesis testing
  • Gather more data
  • Causal inference

We have noticed that Israel vaccinates way more people per day than Wales. The difference between the means of those two countries are big, so we can be pretty sure that those differences are significant. If those differences were small, we could perform a hypothesis test to check if the difference is significant or not.

Knowing that Israel vaccinates more, we could move the analysis to gather more data and seek to find causations of why they are so successful in the vaccination against COVID-19.

As you may have probably noticed, I’ve tried to keep it as simple as possible. This is how I usually do my EDA but this is by no means what you need to follow. There is no one size fits all EDA. Adjust it according to your needs but always remember to analyze it as unbiased as possible.

A Practical Guide to Exploratory Data Analysis (EDA) in Python — How to Start Any Data Analysis. (2024)

FAQs

How do I start EDA in Python? ›

Beginner DSA in Python
  1. Recap - Logic building. Recap on logic building from Beginner's part-1. ...
  2. Array Operations. Basic implementation of arrays in programming problems. ...
  3. Basic string operations. Basic implementation of strings in programming problems. ...
  4. Basic math continued. ...
  5. Debug Algorithmic Problems.

How to start with EDA? ›

Following things are part of EDA :
  1. Get maximum insights from a data set.
  2. Uncover underlying structure.
  3. Extract important variables from the dataset.
  4. Detect outliers and anomalies(if any)
  5. Test underlying assumptions.
  6. Determine the optimal factor settings.

How do you start data analysis in Python? ›

Using Python for Data Analysis
  1. Understanding the Need for a Data Analysis Workflow.
  2. Setting Your Objectives.
  3. Acquiring Your Data. Reading Data From CSV Files. ...
  4. Cleansing Your Data With Python. Creating Meaningful Column Names. ...
  5. Performing Data Analysis Using Python. ...
  6. Communicating Your Findings.
  7. Resolving an Anomaly.
  8. Conclusion.
Jan 17, 2024

Why do we use EDA in Python? ›

EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate.

What does EDA stand for in Python? ›

A. Exploratory Data Analysis (EDA) with Python involves analyzing and summarizing data to gain insights and understand its underlying patterns, relationships, and distributions using Python programming language.

Where do I start with data analysis? ›

What Should I Learn First if I Want To Become a Data Analyst? If you're just starting out in your learning journey, then you should focus on basic math and data skills. So you should work on probability, statistics, and theoretical concepts like data types and data conversions.

How you will start the data analysis process? ›

It's a five-step framework to analyze data. The five steps are: 1) Identify business questions, 2) Collect and store data, 3) Clean and prepare data, 4) Analyze data, and 5) Visualize and communicate data.

Is DSA required for data science? ›

It depends on your career stage.

A strong foundation in Data Structures and Algorithms (DSA) might not be mandatory for every entry-level data science role.

References

Top Articles
Latest Posts
Article information

Author: Saturnina Altenwerth DVM

Last Updated:

Views: 6052

Rating: 4.3 / 5 (44 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Saturnina Altenwerth DVM

Birthday: 1992-08-21

Address: Apt. 237 662 Haag Mills, East Verenaport, MO 57071-5493

Phone: +331850833384

Job: District Real-Estate Architect

Hobby: Skateboarding, Taxidermy, Air sports, Painting, Knife making, Letterboxing, Inline skating

Introduction: My name is Saturnina Altenwerth DVM, I am a witty, perfect, combative, beautiful, determined, fancy, determined person who loves writing and wants to share my knowledge and understanding with you.