Exploratory Data Analysis Python and Pandas with Examples (2024)

This article is about Exploratory Data Analysis(EDA) in Pandas and Python. The article will explain step by step how to do Exploratory Data Analysis plus examples.

EDA is an important step in Data Science. The goal of EDA is to identify errors, insights, relations, outliers and more.

The image below illustrate the data science workflow and where EDA is located:

Exploratory Data Analysis Python and Pandas with Examples (1)

Source: Exploratory Data Analysis - wikipedia

Background Story

Imagine that you are expecting royal guests for dinner. You are asked to research a special menu from a cooking book with thousands of recipes. As they are very pretentious you need to avoid some ingredients or find exact quantities for others. Dinner and launch menus are needed.

Unfortunately some recipes are wrong and others are incomplete.

How to start? You may start with the main ingredients. Another option is by selecting the meal type and course. There should be a special menu for Vegetarians.

The story above reminds of Exploratory Data Analysis(EDA). All the steps and questions are part of understanding unknown territory. Let's try to solve this with Data Science.

Exploratory Data Analysis Python and Pandas with Examples (2)

What is Exploratory Data Analysis?

Exploratory data analysis is like detective work: searching for insights that identify problems and hidden patterns.

Start with one variable at a time, then explore two variables, and so on.

Definition
Exploratory Data Analysis(EDA) - analyze and investigate datasets and summarize their main characteristics and apply visualization methods.

Tools for EDA:

Most popular Tools:

  • Pandas/Python
  • R

Techniques for EDA:

Visual techniques

  • Box plot
  • Histogram
  • Heat map
  • Scatter plot

Dimensionality reduction

  • Principal component analysis (PCA)

Exploratory Data Analysis - Example Project

Let's do an example on Exploratory Data Analysis using Food Recipes.

We will do step by step analysis on this data set and answer on questions like:

  • What data do we have?
  • What is the dimension of this data?
  • Are there any dependent variables?
  • What are the data types?
  • Missing data?
  • Duplicate data?
  • Correlations?

You can find the notebook on and Google Colab:

  • GitHub
  • Google Colab - you can play with this notebook without installing Python

Step 1: Load Data and Initial Analysis

First we will import the needed libraries and then load the dataset into Pandas:

import pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltdf = pd.read_csv('https://raw.githubusercontent.com/softhints/Pandas-Exercises-Projects/main/data/food_recipes.csv', low_memory=False)

Step 2: Initial Analysis of Pandas DataFrame

We will check the data by using the following methods:

  • df - returns first and last 5 records; returns number of rows and columns
  • head(n) - returns first n rows
  • tail(n) - returns last n rows
  • sample(n) - sample random n rows

The first 2 rows transposed looks like:

01
recipe_titleRoasted Peppers And Mushroom Tortilla Pizza RecipeThakkali Gotsu Recipe | Thakkali Curry | Spicy & Tangy Tomato Gravy
urlhttps://www.archanaskitchen.com/roasted-peppers-and-mushroom-tortilla-pizza-recipehttps://www.archanaskitchen.com/tomato-gotsu-recipe-spicy-tangy-tomato-gravy
record_healthgoodgood
vote_count4343423
rating4.9585254.932223
descriptionis a quicker version pizza ....:also known as the is a ...
cuisineMexicanSouth Indian Recipes
courseDinnerLunch
dietVegetarianVegetarian
prep_time15 M10 M
cook_time15 M20 M
ingredientsTortillas|Extra Virgin Olive Oil|Garlic|Mozzarella cheese|Red Yellow or Green Bell Pepper (Capsicum)|Onions|Kalmatta olives|Button mushroomsSesame (Gingelly) Oil|Mustard seeds (Rai/ Kadugu)|Curry leaves|Garlic|Pearl onions (Sambar Onions)|Tomatoes|Tamarind|Turmeric powder (Haldi)|Salt|Jaggery
instructionsTo begin making the Roasted ...To begin making Tomato Gotsu Recipe/ ....
authorDivya ShivaramanArchana Doshi
tagsParty Food Recipes|Tea Party Recipes|Mushroom Recipes|Fusion Recipes|Tortilla Recipe|Bell Peppers RecipesVegetarian Recipes|Tomato Recipes|South Indian Recipes|Breakfast Recipe Ideas
categoryPizza RecipesIndian Curry Recipes

We are lucky with:

df.sample(2)

as we can see missing values and non latin symbols:

recipe_titleurlrecord_healthvote_countratingdescriptioncuisinecoursedietprep_timecook_timeingredientsinstructionsauthortagscategory
2400Coconut Chia Pudding Recipehttps://www.archanaskitchen.com/coconut-chia-pudding-recipegood1624.975309Coconut Chia Pudding makes a great snack or even ...NaNNaNNaNNaNNaNChia Seeds|Tender coconut water|Coconut milk|Orange|Blueberries|Pistachios|Honey|Mint Leaves (Pudina)To begin making the Coconut Chia Pudding recipe...Gauravi VinayHealthy Recipes|Breakfast Recipe Ideas|Pudding Recipes|Coconut RecipesDessert Recipes
1447पंजाबी स्टाइल बूंदी कढ़ी रेसिपी - Punjabi Style Boondi Kadhi Recipehttps://www.archanaskitchen.com/punjabi-style-boondi-kadhi-recipe-in-hindigood6854.954745एक प्रसिद्ध कढ़ी है जिसे राजस्थान और पंजाब में बनाया जाता है. इसमें मेथी के दाने भी डाले जाते है ...PunjabiSide DishVegetarian10 M30 Mदही|बेसन|हरी मिर्च|अदरक लहसुन का पेस्ट|लाल मिर्च पाउडर|हल्दी पाउडर|नमक|घी|लॉन्ग|पूरी काली मिर्च|जीरा|मेथी के दाने|हींग|बूंदीपंजाबी स्टाइल बूंदी कढ़ी रेसिपी बनाने के लिए सबसे पहले एक बाउल में ...Archana's KitchenSide Dish Recipes|Indian Lunch Recipes|Dahi Recipes (Curd/Yogurt)|Indian Dinner Recipes|Winter RecipesKadhi Recipes

The initial findings for this dataset are:

Insights

  • dataset has header column
  • separator is ';'
  • NaN represents missing values
  • 2 columns with nested data - ingredients and tags
    • if we take closer look we can find values divided by pipes - "|"
    • for further analysis - data need to be normalized
  • there are non latin characters in this dataset

Let's continue with the analysis and not rely on luck or first impressions.

Step 3: Dimension and data types of DataFrame

Next we will use several Pandas methods in order to find information for the columns, dimensions, nested data and more:

df.shape

will return the dimension of this dataset:

(8009, 16)

We can get columns by:

df.columns
Index(['recipe_title', 'url', 'record_health', 'vote_count', 'rating', 'description', 'cuisine', 'course', 'diet', 'prep_time', 'cook_time', 'ingredients', 'instructions', 'author', 'tags', 'category'], dtype='object')

To get information about the data types we can use:

df.info()df.dtypes

which results into:

<class 'pandas.core.frame.DataFrame'>RangeIndex: 8009 entries, 0 to 8008Data columns (total 16 columns):Column Non-Null Count Dtype --- ------ -------------- ----- 0 recipe_title 8009 non-null object 1 url 8009 non-null object 2 record_health 8009 non-null object 3 vote_count 8009 non-null int64 4 rating 8009 non-null float64 5 description 7994 non-null object 6 cuisine 7943 non-null object 7 course 7854 non-null object 8 diet 7858 non-null object 9 prep_time 7979 non-null object 10 cook_time 7979 non-null object 11 ingredients 7997 non-null object 12 instructions 8009 non-null object 13 author 8009 non-null object 14 tags 7930 non-null object 15 category 8009 non-null objectdtypes: float64(1), int64(1), object(14)memory usage: 1001.2+ KB

The above gives use:

  • 3 data types - float64(1), int64(1), object(14)
  • memory usage: 1001.2+ KB
  • information for the rows and the columns

By using the method df.shape() we can find the total number of rows and columns. Which has synonyms:

  • row - observation, record, trial
  • column - variable, feature, characteristic

Data insights from this step:

Insights:

  • dataset contains - 8009 rows and 16 columns
    • 8009 observations
    • 16 characteristics
  • 2 numeric columns
  • 2 which need conversion - prep_time and cook_time
    • convert string to int
    • 15 M -> 15

Step 4: Statistics - min, max, mean, percentile

We can get useful statistics for numeric and non numeric columns of this DataFrame.

4.1. Descriptive statistics for numeric columns

For numeric columns we can use:

df.describe()

which returns stats like:

  • mean
  • standard deviation
  • percentile
  • count
  • min and max
vote_countrating
count8009.0000008009.000000
mean2268.0044954.888621
std3683.1565700.077467
min15.0000003.175705
25%494.0000004.865031
50%1050.0000004.900553
75%2487.0000004.930000
max80628.0000005.000000

Insights:

  • mean value is close to 75%
  • no missing values
  • big difference between max and 75% percentile for vote_count - potential outliers

4.2. Pandas describe() for non numeric columns

For the rest of the columns we can use include='object':

df.describe(include='object')

the result is:

recipe_titleurlrecord_healthdescriptioncuisinecoursedietingredients
count80098009800979947943785478587997
unique80098009179897713107953
topRoasted Peppers And Mushroom Tortilla Pizza Recipehttps://www.archanaskitchen.com/roasted-peppers-and-mushroom-tortilla-pizza-recipegoodIf you like this recipe, try more recipes likeIndianLunchVegetarianLemons|Sugar|Salt|Red Chilli powder
freq11800961336192554783

Few columns seems to be more interesting than the rest:

Insights:

  • 4 categorical columns - category, cuisine, diet and course
  • 1 column with single value - record_health

Step 5: Inspect single column of DataFrame

df.diet.nunique()

returns the number of unique values in this column:

10
df.diet.unique()

will give us all unique values:

array(['Vegetarian', 'High Protein Vegetarian', 'Non Vegeterian', 'No Onion No Garlic (Sattvic)', 'High Protein Non Vegetarian', 'Diabetic Friendly', 'Eggetarian', 'Vegan', 'Gluten Free', nan, 'Sugar Free Diet'], dtype=object)
df.diet.value_counts(dropna=False)

returns the values and their frequency:

Vegetarian 5478High Protein Vegetarian 792Non Vegeterian 447Eggetarian 391Diabetic Friendly 296High Protein Non Vegetarian 225NaN 151No Onion No Garlic (Sattvic) 77Vegan 71Gluten Free 66Sugar Free Diet 15Name: diet, dtype: int64

Insights:

  • 10 unique values
  • categorical column
  • Vegetarian - is the most frequent value
  • 151 - missing values

Step 6: Data conversion in Pandas

Sometimes data needs to be processed for proper analysis. For example nested data to be expanded, string needs to be converted to number or date.

We'll see two examples of data processing required for further data analysis.

6.1 Convert String to numeric

To analyze columns cook_time and prep_time as numeric we can convert them by:

df['cook_time'] = df['cook_time'].str.replace(' M', '').fillna(0).astype(int)df['prep_time'] = df['prep_time'].str.replace(' M', '').fillna(0).astype(int)

Now we can apply visual techniques to analyze those columns for outliers - this is done in section 7.3 of this article.

6.2 Expand nested column

Next we will expand a column "ingredients" - in order to analyze ingredients for each recipe:

df['ingredients'].str.split('|', expand=True)

The result is new DataFrame with expanded values from the ingredient column:

from:

0 Tortillas|Extra Virgin Olive Oil|Garlic|Mozzar...1 Sesame (Gingelly) Oil|Mustard seeds (Rai/ Kadu...2 Extra Virgin Olive Oil|Pineapple|White onion|R...

to:

01234
0TortillasExtra Virgin Olive OilGarlicMozzarella cheeseRed Yellow or Green Bell Pepper (Capsicum)
1Sesame (Gingelly) OilMustard seeds (Rai/ Kadugu)Curry leavesGarlicPearl onions (Sambar Onions)
2Extra Virgin Olive OilPineappleWhite onionRed Yellow and Green Bell Peppers (Capsicum)Pickled Jalapenos
3Arhar dal (Split Toor Dal)Turmeric powder (Haldi)SaltDry Red ChilliesMustard seeds (Rai/ Kadugu)
4Rajma (Large Kidney Beans)Cashew nutsSultana RaisinsAsafoetida (hing)Cumin seeds (Jeera)

To find the most common 'ingredient' in the all recipes we can use the following steps:

  • split the column on separator "|"
    • expand=True - will expand data to DataFrame
  • melt() - transform DataFrame columns to rows - pairs of variable and value
  • count the values by .value_counts()
df['ingredients'].str.split('|', expand=True).melt()['value'].value_counts()

we found that salt is the most popular ingredient:

Salt 5698Sunflower Oil 2241Garlic 2227Turmeric powder (Haldi) 2221Onion 2012 ...नारीयल पानी 1नारीयल का दूध 1सिंघाड़े का आटा 1Black currant (dried) 1Corn chips 1Name: value, Length: 1823, dtype: int64

Step 7: Visualization for Data Analysis

Finally lets cover a few visualization techniques to find missing values, outliers and correlations.

7.1 Missing values

The first visualization technique is about finding missing values in Pandas DataFrame.

Use library Seaborn with heat map to plot missing values for the whole dataset:

  • df.isna() - returns new DataFrame with True/False values depending on the missing values in a cell
  • cmap = 'Greens' - is a color map. For more info check: Get color palette with matplotlib
sns.heatmap(df.isna(),cmap = 'Greens')

result is:

Exploratory Data Analysis Python and Pandas with Examples (3)

Next let's check the same visualization for the last 3000 rows:

sns.heatmap(df.tail(3000).isna(),cmap = 'Greens')

result:

Exploratory Data Analysis Python and Pandas with Examples (4)

From the result above we can conclude:

Insights:

  • data set has missing values in several columns
    • represented by darker green lines
  • cuisine, course and diet
    • plot suggest pattern for missing values of those columns

7.2 Correlations

Again will use Seaborn's heatmap in order to visualize correlations between two and more columns:

sns.heatmap(df.corr(),cmap='Greens',annot=False)

The first option is for the whole dataset shows - no correlation between the columns.

Exploratory Data Analysis Python and Pandas with Examples (5)

sns.heatmap(df[df.diet=='Vegan'].corr(),cmap='Greens',annot=True)

While the second one is showing correlations for rows with type "Vegan". We can see a correlation between columns 'prep_time' and 'cook_time'.

Exploratory Data Analysis Python and Pandas with Examples (6)

Insights:

  • no correlation of numerical columns for the whole DataFrame
  • correlation for - Vegan recipes - between prep_time and cook_time
  • darker color represent positive correlation
    • lighter - negative
  • to get values use annot=True

7.3 Detect outliers in Pandas DataFrame

Finally we will use boxplot to detect outliers:

cols = ['vote_count', 'rating', 'prep_time', 'cook_time']plt.figure(figsize=(20, 5))sns.boxplot(data=df[cols], orient='h')

The picture below suggest about outliers in the numeric columns

Exploratory Data Analysis Python and Pandas with Examples (7)

Insights:

  • vote_count - suggest outliers
  • cook_time - has possible outliers
  • rating - doesn't indicate outliers

7.4 Explain Seaborn boxplot and outliers

In order to get a better understanding we will work only with two columns: 'prep_time', 'cook_time' and the first 18 records.

The stats for them are:

df[['prep_time', 'cook_time']].head(18).describe()
prep_timecook_time
count18.00000018.000000
mean12.22222222.388889
std4.60888617.091928
min5.0000000.000000
25%10.00000011.250000
50%10.00000020.000000
75%15.00000030.000000
max20.00000060.000000

The picture below illustrates how to analyze a dataset for outliers with boxplot.

Exploratory Data Analysis Python and Pandas with Examples (8)

Boxplot has 3 main components:

  • box (Interquartile Range - IQR)
    • q1 - 25%
    • q2 - 50% or median
    • q3 - 75%
  • whiskers
    • min
    • max
  • outliers
    • points beyond the whiskers

We can notice that sometimes the whiskers don't match the min/max. They are calculated based on formulas. Upper and low whiskers are calculated as 1.5 multiplied on the interquartile range - IQR(the box).

Definition
An outlier(extreme) is a value that lies in the extremes of a data Series(on either end). They can affect overall observation and insights.

In the next article we will cover more techniques and several Python libraries for exploratory data analysis.

Conclusion

In this article, we covered the basics of Exploratory Data Analysis and discussed most of the common methods which every data scientist should know while working with datasets. Through practical examples, we learned how to explore new datasets.

We started by initial data analysis and first glance of data. After that, we made data conversion and preprocessing for further analysis. Finally, we saw multiple visual techniques in order to find missing values, outliers and correlations.

Exploratory Data Analysis Python and Pandas with Examples (2024)

References

Top Articles
Latest Posts
Article information

Author: Neely Ledner

Last Updated:

Views: 6048

Rating: 4.1 / 5 (42 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Neely Ledner

Birthday: 1998-06-09

Address: 443 Barrows Terrace, New Jodyberg, CO 57462-5329

Phone: +2433516856029

Job: Central Legal Facilitator

Hobby: Backpacking, Jogging, Magic, Driving, Macrame, Embroidery, Foraging

Introduction: My name is Neely Ledner, I am a bright, determined, beautiful, adventurous, adventurous, spotless, calm person who loves writing and wants to share my knowledge and understanding with you.