Exploratory Data Analysis (EDA) using Python – Second step in Data Science and Machine Learning


In the previous post, “Tidy Data in Python – First Step in Data Science and Machine Learning”, we discussed the importance of the tidy data and its principles. In a Machine Learning project, once we have a tidy dataset in place, it is always recommended to perform EDA (Exploratory Data Analysis) on the underlying data before fitting it into a Machine Learning model. Let’s start understanding the importance of EDA and some basic EDA techniques which are very useful.

What is Exploratory Data Analysis (EDA)

Exploratory Data Analysis or EDA, is the process of organizing, plotting and summarizing the data to find trends, patterns, and outliers using statistical and visual methods. It takes input data from a tabular format and represents it in a graphical format which makes it more human interpretable. It is an important step in a Machine Learning/Data Science project which should be performed before creating a Machine Learning model or a statistical model. Exploratory Data Analysis helps us in making assumptions required for statistical hypothesis testing or for Machine Learning Model fitting.

Fitting the data into the model without doing EDA is like fitting the data into a black box and waiting for the result. Exploratory Data Analysis helps us in extracting critical information at early stage and reduces the chance of getting unexpected outcomes from the applied ML model.

An American mathematician, John Tukey (who coined the term exploratory data analysis) said, “Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.

Exploratory Data Analysis is useful in:

  1. Understanding the underlying data at a deeper level.
  2. Bringing important aspects of the data into focus which are not easy to find using summary statistics like mean, mode, median, correlation, covariance, variance, standard deviation and other statistical measures.
  3. Identifying important features in the dataset.
  4. Finding the relationship between variables.
  5. Identifying erroneous and outliers in the dataset.
  6. Performing a hypothesis test on the dataset.

Importance of EDA

Let’s understand the importance of EDA with the help of Anscombe’s quartet. In 1973, statistician Francis Anscombe published four fictitious datasets having 11 pairs of x and y values. While computing the summary statistics (mean, median, variance, standard deviation, correlation, and etc.) on these datasets, we get identical values for all the datasets. Based on the summary statistic values, we can make an assumption that these datasets are identical. However, when we perform EDA on these datasets, we find that these datasets are a lot different. The datasets are as:

Anscombe quartet

Anscombe quartet

When we compute summary statistics for all four datasets, we find identical values:

Anscombe quartet - Summary Statistics

Anscombe quartet – Summary Statistics

Based on the above summary statistics, we can say that these datasets are identical because they have almost same values for mean, variance, and correlation. Also, the equation of the regression line is same for all the datasets. However, let’s perform graphical EDA before making any assumtion based on the numerical statistical measures only.

To demonstrate the graphical EDA using Anscombe’s data, we need to import the required packages first. Then we can create a class named anscombe_data with a method named get_anscombe_data(). We will be using this class for demo purpose throughout this post.

import numpy as np, matplotlib.pyplot as plt, pandas as pd, seaborn as sns
class anscombe_data():
    def __init__(self):
        self.x1 = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
        self.y1 = np.array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
        self.x2 = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
        self.y2 = np.array([9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74])
        self.x3 = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
        self.y3 = np.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])
        self.x4 = np.array([8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8])
        self.y4 = np.array([6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89])
    def get_anscombe_data(self):
        return self.x1, self.y1, self.x2, self.y2, self.x3, self.y3, self.x4, self.y4

Once, we have created the class “anscombe_data” and have imported the required packages, we can use this code to compute the statistical summary on these datasets.

def get_statistical_measures():
    x1, y1, x2, y2, x3, y3, x4, y4 = anscombe_data().get_anscombe_data()
    print('-' * 100)
    print('Mean of x1 = {0} and of y1 = {1}'.format(np.mean(x1), np.mean(y1)))
    print('Mean of x2 = {0} and of y2 = {1}'.format(np.mean(x2), np.mean(y2)))
    print('Mean of x3 = {0} and of y3 = {1}'.format(np.mean(x3), np.mean(y3)))
    print('Mean of x4 = {0} and of y4 = {1}'.format(np.mean(x4), np.mean(y4)))
    print('-' * 100)
    print('')
    print('-' * 100)
    print('Sample variance of x1 = {0} and of y1 = {1}'.format(np.var(x1, ddof = 1), np.var(y1, ddof = 1)))
    print('Sample variance of x2 = {0} and of y2 = {1}'.format(np.var(x2, ddof = 1), np.var(y2, ddof = 1)))
    print('Sample variance of x3 = {0} and of y3 = {1}'.format(np.var(x3, ddof = 1), np.var(y3, ddof = 1)))
    print('Sample variance of x4 = {0} and of y4 = {1}'.format(np.var(x4, ddof = 1), np.var(y4, ddof = 1)))
    print('-' * 100)
    print('')
    print('-' * 100)
    print('Correlation between x1 and y1 = {0}'.format(np.corrcoef(x1, y1)[0,1]))
    print('Correlation between x2 and y2 = {0}'.format(np.corrcoef(x2, y2)[0,1]))
    print('Correlation between x3 and y3 = {0}'.format(np.corrcoef(x3, y3)[0,1]))
    print('Correlation between x4 and y4 = {0}'.format(np.corrcoef(x4, y4)[0,1]))
    print('-' * 100)
    print('')
    print('-' * 100)
    print('Equation of linear regression line = {1} x + {0} (rounded off to 2 decimal places)'.format(np.round(np.polyfit(x1, y1, deg = 1)[1], 2), np.round(np.polyfit(x1, y1, deg = 1)[0], 2)))
    print('Equation of linear regression line = {1} x + {0} (rounded off to 2 decimal places)'.format(np.round(np.polyfit(x2, y2, deg = 1)[1], 2), np.round(np.polyfit(x2, y2, deg = 1)[0], 2)))
    print('Equation of linear regression line = {1} x + {0} (rounded off to 2 decimal places)'.format(np.round(np.polyfit(x3, y3, deg = 1)[1], 2), np.round(np.polyfit(x3, y3, deg = 1)[0], 2)))
    print('Equation of linear regression line = {1} x + {0} (rounded off to 2 decimal places)'.format(np.round(np.polyfit(x4, y4, deg = 1)[1], 2), np.round(np.polyfit(x4, y4, deg = 1)[0], 2)))
    print('-' * 100)
get_statistical_measures()

Basic Exploratory Data Analysis Techniques in Python

Let’s learn some basic exploratory data analysis techniques on the Anscombe’s datasets which we can perform in Python.

Scatter plot

Scatter plot is used to display two correlated variables on x and y axis considering x as independent and y as dependent variable. In the above datasets, we have two correlated variables (x and y) and that is why scatter plot can be a native choice in this case. It will help us in identifying how x and y are changing together.

Scatter plot - Anscombe's quartet

Scatter plot – Anscombe’s quartet

In the above scatter plot, we can see that the first dataset (x1 and y1) can be modeled by a line and we can use a linear regression model here. The second dataset (x2 and y2) is nonlinear and we need to choose a different model. The third dataset (x3 and y3) can also be modeled by a line but the outlier has a significant impact on the slope and intercept of the line which should be studied before fitting the model. The fourth dataset (x4 and y4) may or may not have the linear relationship and we need to find more data values for x and y to make any conclusion.

To create the above scatter plot using Anscombe’s data in Python, we can use below code once we have created the class (anscombe_data) and have imported the required packages.

def draw_plot_scatter():
    x1, y1, x2, y2, x3, y3, x4, y4 = anscombe_data().get_anscombe_data()
    plt.figure(1)
    plt.subplot(2, 2, 1)
    plt.plot(x1, y1, linestyle = 'none', marker = '.')
    plt.xlabel('x1')
    plt.ylabel('y1')
    a, b = np.polyfit(x1, y1, deg = 1)
    minmaxX = np.array([0, 20])
    minmaxY = (a * minmaxX) + b
    plt.plot(minmaxX, minmaxY)
    plt.xticks(range(4, 20, 2))
    plt.yticks(np.arange(4, 14, 2))

    plt.subplot(2, 2, 2)
    plt.plot(x2, y2, linestyle = 'none', marker = '.')
    plt.xlabel('x2')
    plt.ylabel('y2')
    a, b = np.polyfit(x2, y2, deg = 1)
    minmaxX = np.array([0, 20])
    minmaxY = (a * minmaxX) + b
    plt.plot(minmaxX, minmaxY)
    plt.xticks(range(4, 20, 2))
    plt.yticks(np.arange(4, 14, 2))
    
    plt.subplot(2, 2, 3)
    plt.plot(x3, y3, linestyle = 'none', marker = '.')
    plt.xlabel('x3')
    plt.ylabel('y3')
    a, b = np.polyfit(x3, y3, deg = 1)
    minmaxX = np.array([0, 20])
    minmaxY = (a * minmaxX) + b
    plt.plot(minmaxX, minmaxY)
    plt.xticks(range(4, 20, 2))
    plt.yticks(np.arange(4, 14, 2))
    
    plt.subplot(2, 2, 4)
    plt.plot(x4, y4, linestyle = 'none', marker = '.')
    plt.xlabel('x4')
    plt.ylabel('y4')
    a, b = np.polyfit(x4, y4, deg = 1)
    minmaxX = np.array([0, 20])
    minmaxY = (a * minmaxX) + b
    plt.plot(minmaxX, minmaxY)
    plt.xticks(range(4, 20, 2))
    plt.yticks(np.arange(4, 14, 2))
    plt.show()
draw_plot_scatter()

Histogram

Histogram is generally used to plot a single column value which creates equal sized bins on x axis. y axis shows the total count of the values falling inside a bin range. The height of the bin shows that how many values are falling in a particular bin. We can manually set the size of the bins. If the number of bins is not provided, the histogram function uses an automatic binning algorithm that returns bins with a uniform width, chosen to cover all the elements in X to reveal the underlying shape of the distribution. However, the histogram created using same data can look different with different number of bins which is known as binning bias. Lets have a look at the histogram of x and y data for all the Anscombe’s datasets.

Histogram - Anscombe's quartet

Histogram – Anscombe’s quartet

In the above histograms, we can see that the values of x in datasets 1,2, and 3  are same, however, in dataset 4, we have different values of x. The histogram of y values is different for all the datasets.

To create the histograms for x and y values using Anscombe’s data, we can use below code once we have created the class (anscombe_data) and have imported the required packages.

def draw_plot_hist():
    x1, y1, x2, y2, x3, y3, x4, y4 = anscombe_data().get_anscombe_data()
    sns.set()
    noofbins = [0, 5, 10, 15, 20]
    plt.figure(1)
    plt.subplot(2, 2, 1)
    _ = plt.hist(x1, bins = noofbins)
    plt.subplot(2, 2, 2)
    _ = plt.hist(x2, bins = noofbins)
    plt.subplot(2, 2, 3)
    _ = plt.hist(x3, bins = noofbins)
    plt.subplot(2, 2, 4)
    _ = plt.hist(x4, bins = noofbins)

    plt.figure(2)
    plt.subplot(2, 2, 1)
    _ = plt.hist(y1, bins = noofbins)
    plt.subplot(2, 2, 2)
    _ = plt.hist(y2, bins = noofbins)
    plt.subplot(2, 2, 3)
    _ = plt.hist(y3, bins = noofbins)
    plt.subplot(2, 2, 4)
    _ = plt.hist(y4, bins = noofbins)
    plt.show()
draw_plot_hist()

Bee swarm plot

The histogram can be impacted by binning bias if the number of bins is not chosen correctly. Also, instead of displaying the actual data points, we sweep them into the bins. If we want to plot individual data points, we can use Bee swarm plot. The Bee swarm plot displays each data point of a single variable. Let’s have a look at the swarm plot of x and y values for all the Anscombe’s datasets.

Bee swarm plot - Anscombe's quartet

Bee swarm plot – Anscombe’s quartet

The swarm plot clearly shows that the values of x is same for datasets 1,2, and 3, however, dataset 4 has different values of x. Also, the values of y is different for each dataset.

To create the bee swarm plot for x and y values using Anscombe’s data, we can use below code once we have created the class (anscombe_data) and have imported the required packages.

def draw_plot_swarm():
    x1, y1, x2, y2, x3, y3, x4, y4 = anscombe_data().get_anscombe_data()
    plt.figure(1)
    xvals1 = np.concatenate([(['x1'] * len(x1)), (['x2'] * len(x2)), (['x3'] * len(x3)), (['x4'] * len(x4))])
    yvals1 = np.concatenate([x1, x2, x3, x4])
    sns.swarmplot(x = xvals1, y = yvals1)

    plt.figure(2)
    xvals2 = np.concatenate([(['y1'] * len(y1)), (['y2'] * len(y2)), (['y3'] * len(y3)), (['y4'] * len(y4))])
    yvals2 = np.concatenate([y1, y2, y3, y4])
    sns.swarmplot(x = xvals2, y = yvals2)
    plt.show()
draw_plot_swarm()

Box plot

Bee swarm plot is very useful but as the data grows, the bee swarm plot can become very nasty and the data points start overlapping each other. To overcome this issue without sacrificing the value of plotting each data point of the dataset, we can use box plot. The Box plot represents the percentiles of numerical data using a graphical EDA. The total height of the box contains the middle 50% of the data which is also called as inter-quartile range or IQR. The center of the box plot is the median value. The lower edge of the box plot is the 25th percentile and the upper edge is the 75th percentile value. The whiskers extends upto 1.5 times of the IQR or to the extent of the data, whichever is less. The points beyond the whiskers are shown as outliers in the box plot. Let’s have a look at the box plot of the x and y values for each Anscombe’s dataset.

Box plot - Anscombe's quartet

Box plot – Anscombe’s quartet

The box plot clearly shows that the values of x is same for datasets 1,2, and 3, however, dataset 4 has different values of x. Also, the values of y is different for each dataset.

To create the box plot for x and y values using Anscombe’s data, we can use below code once we have created the class (anscombe_data) and have imported the required packages.

def draw_plot_box():
    x1, y1, x2, y2, x3, y3, x4, y4 = anscombe_data().get_anscombe_data()
    plt.figure(1)
    xvals1 = np.concatenate([(['x1'] * len(x1)), (['x2'] * len(x2)), (['x3'] * len(x3)), (['x4'] * len(x4))])
    yvals1 = np.concatenate([x1, x2, x3, x4])
    sns.boxplot(x = xvals1, y = yvals1)

    plt.figure(2)
    xvals2 = np.concatenate([(['y1'] * len(y1)), (['y2'] * len(y2)), (['y3'] * len(y3)), (['y4'] * len(y4))])
    yvals2 = np.concatenate([y1, y2, y3, y4])
    sns.boxplot(x = xvals2, y = yvals2)
    plt.show()
draw_plot_box()

Violin plot

Violin plot is similar to the box plot and is used to display the numerical values of a single variable in a graphical way. It also displays the probability density of the data at the different values. Let’s have a look at the x and y values for each Anscombe’s dataset.

Violin plot - Anscombe's quartet

Violin plot – Anscombe’s quartet

The violin plot clearly shows that the values of x is same for datasets 1,2, and 3, however, dataset 4 has different values of x. Also, the values of y is different for each dataset.

To create the violin plot for x and y values using Anscombe’s data, we can use below code once we have created the class (anscombe_data) and have imported the required packages.

def draw_plot_violin():
    x1, y1, x2, y2, x3, y3, x4, y4 = anscombe_data().get_anscombe_data()
    plt.figure(1)
    xvals1 = np.concatenate([(['x1'] * len(x1)), (['x2'] * len(x2)), (['x3'] * len(x3)), (['x4'] * len(x4))])
    yvals1 = np.concatenate([x1, x2, x3, x4])
    sns.violinplot(x = xvals1, y = yvals1)

    plt.figure(2)
    xvals2 = np.concatenate([(['y1'] * len(y1)), (['y2'] * len(y2)), (['y3'] * len(y3)), (['y4'] * len(y4))])
    yvals2 = np.concatenate([y1, y2, y3, y4])
    sns.violinplot(x = xvals2, y = yvals2)
    plt.show()
draw_plot_violin()

Thanks for reading. Please share your inputs in the comments.

Rate This
[Total: 0    Average: 0/5]

Gopal Krishna Ranjan

About Gopal Krishna Ranjan

Gopal has 8 years of industry experience in Software development. He has a head down experience in Data Science, Database, Data Warehouse, Big Data and cloud technologies and has implemented end to end solutions. He has extensively worked on SQL Server, Python, Hadoop, Hive, Spark, Azure, Machine Learning, and MSBI (SSAS, SSIS, and SSRS). He also has good experience in windows and web application development using ASP.Net and C#.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.