Tidy Data in Python – First Step in Data Science and Machine Learning

Most of the Data Science / Machine Learning projects follow the Pareto principle where we spend almost 80% of the time in data preparation and remaining 20% in choosing and training the appropriate ML model. Mostly, the datasets we get to create Machine Learning models are messy datasets and cannot be fitted into the model directly. We need to perform some data cleaning steps in order to get a dataset which then can be fitted into the model. We need to make sure that the data we are inputting into the model is a tidy data. Indeed, it is the first step in a Machine Learning / Data Science project. We may need to repeat the data cleaning process many times as we face new challenges and problems while cleaning the data. Data cleaning is one of the most important and time taking process a Data Scientist performs before performing any EDA (Exploratory Data Analysis) or statistical analysis on the data.

What is Tidy Data

Tidy data is the data which is obtained as as an outcome of processes applied to clean the data in such a way so that the data can be visualized and used in a model in an easy way. Tidy data norms helps us in data analysis process by defining some guidelines which we need to follow while performing the data cleaning operations. It provides a standard way to organize the data values within a dataset and formalizes the way we define the shape of the dataset. Also, the tidy data makes the data manipulation process easier. As in database normalization forms(1NF, 2NF, 3NF, & BCNF), tidy data principles have some predefined rules. Let’s have a look at the tidy data principles.

Tidy Data Principles

The tidy data principles are:

  1. Each column represents a separate variable in the dataset
  2. Each row represents an individual observation in the dataset
  3. Each type of observational unit forms a table

These rules are equivalent to the database third normal form (3NF).

Let’s have a look at some real life dataset examples to understand the tidy data principles in more detail.

1. Each column represents a separate variable in the dataset

Assume that we have a dataset which stores the day-wise ice-cream sale data for each employee. Click here to download the sample csv file.

Day-wise Ice-cream sale details
Day-wise Ice-cream sale details

This data looks better for reporting purpose, however, this format is not good for analysis purpose. In this data, each day (Mon to Sun) represents an individual column as a separate variable. However, as per the first rule of the Tidy Data principles (Each column represents a separate variable in the dataset), we need to transpose the day names into a single column as they are representing a single variable – the week day name.

We can use pandas.melt method to convert the day name columns into a single row values which will help us to convert this messy dataset into the tidy dataset. We can use this python script.

#Line 1 - Import pandas library under alias pd
import pandas as pd

#Line 2 - Read the existing csv file in a dataframe using pandas read_csv method
df = pd.read_csv(r'D:\\SQL Server Tutorial\\Day wise Ice-cream sale detail (in dollars).csv')

#Line 3 - Melt the dataframe df to put the day names into rows
dfMelted = pd.melt(frame = df, 
        id_vars = 'EmployeeName', 
        value_vars = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], 
        var_name = 'WeekDay', value_name = 'AmountInDollars')

#Line 4 - Print the melted dataframe
print(dfMelted)

Output:

Tidy Data output (Top 10 rows only)
Tidy Data output (Top 10 rows only)

Let’s discuss each line of code in detail:

#Line 1

import pandas as pd

First, we are using an import command to import the pandas library under the standard alias pd.

#Line 2

df = pd.read_csv(r'D:\\Day wise Ice-cream sale detail (in dollars).csv')

In this line, we are using the read_csv method of pandas library to read the existing csv file in a dataframe.

#Line 3

dfMelted = pd.melt(frame = df, 
        id_vars = 'EmployeeName', 
        value_vars = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], 
        var_name = 'WeekDay', value_name = 'AmountInDollars')

In this line, we are using the pandas melt method to transpose the day name columns into a single row. We are passing EmployeeName column in id_vars params to make it identifier. Also, we are passing all the column names which need to be unpivoted as row values in value_vars param. Using var_name and value_name params, we are supplying meaningful names to the unpivoted column and the result column.

#Line 4

print(dfMelted)

In this line, we are printing the melted data frame which is a tidy dataset now.

2. Each row represents an individual observation in the dataset

Let’s assume that we have weather data of few cities with the minimum and maximum temperature details for each date. Click here to download the sample csv file.

Date-wise weather data
Date-wise weather data

In this dataset, we have two issues:

  1. The date values are on columns which need to be moved in a row as they are representing a single variable – date.
  2. The min and max temperature values which are placed into rows, need to be transposed as individual columns as they are representing two different variables – minimum and maximum temepartures.

To get this done, we can use pandas melt method followed by a pivot or pivot_table method. To move the date values into rows, we can use melt method and then we can apply the pandas pivot or pivot_table method to move the min and max row values as individual columns in the dataset.

This is the python script we are using to convert the weather dataset into a tidy dataset.

#Line 1 - Import pandas library under alias pd
import pandas as pd

#Line 2 - Read the existing csv file in a dataframe using pandas read_csv method
df = pd.read_csv(r'D:\\Date wise weather data.csv')

#Line 3 - Melt the dataframe df to move the date values in rows
dfMelted = pd.melt(frame = df, 
        id_vars = ['CityName', 'VarType'], 
        value_vars = ['01-08-2018', '02-08-2018', '03-08-2018', '04-08-2018', '05-08-2018', '06-08-2018', '07-08-2018'], 
        var_name = 'Date', value_name = 'Temperature')
        
#Line 4 - Move the temperature values as individual columns using pivot_table method
dfPivoted = dfMelted.pivot_table(
                index = ['CityName', 'Date'], 
                columns = 'VarType', 
                values = 'Temperature')

#Line 5 - Reset the index columns as regular dataframe columns
dfPivoted.reset_index(inplace = True)

#Line 6 - Set the column list name to empty
dfPivoted.columns.name = ''

#Line 7 - Print the melted dataframe
print(dfPivoted)

Output:

Tidy Data output(Top 10 sample rows)
Tidy Data output(Top 10 sample rows)

Let’s discuss each line of code in detail.

#Line 1

import pandas as pd

First, we are using an import command to import the pandas library under alias pd.

#Line 2

df = pd.read_csv(r'D:\\Date wise weather data.csv')

In this line, we are using the read_csv method of pandas library to read the existing csv file in a dataframe.

#Line 3

dfMelted = pd.melt(frame = df, 
        id_vars = ['CityName', 'VarType'], 
        value_vars = ['01-08-2018', '02-08-2018', '03-08-2018', '04-08-2018', '05-08-2018', '06-08-2018', '07-08-2018'], 
        var_name = 'Date', value_name = 'Temperature')

In this line, we are using pandas melt method to transpose the date values into a single row. We are using CityName and VarType in id_vars params to make it identifier. Also, we are passing all the column names which need to be unpivoted as row values in value_vars param. Using var_name and value_name params, we are supplying meaningful names to the unpivoted column and value column.

#Line 4

dfPivoted = dfMelted.pivot_table(
                index = ['CityName', 'Date'], 
                columns = 'VarType', 
                values = 'Temperature')

In this line, we are pivoting the temperature column values to move the row values as individual columns in the data frame. We can use the python method chaining technique to perform melt and pivot in a single line of code also. However, here, we are using a new data frame to hold the pivoted output data for easy understanding.

As we don’t need to transpose the CityName and Date columns, we are passing these columns in a list in index param. The VarType column is passed to columns param which needs to be pivoted. Finally, we have used Temperature column as value column in values param.

#Line 5

dfPivoted.reset_index(inplace = True)

As we have used CityName and Date columns as index column, both these columns will be used as dataframe index column. Also, this will be a multilevel index column. If we want to move these index columns as regular columns, we can use pandas reset_index method (without any index values) to reset the indexes to the regular range index. We have set the inplace param value to True to transform the dataframe in-place rather than creating a new dataframe.

#Line 6

dfPivoted.columns.name = ''

Here, we are resetting the column name list to empty.

#Line 7

print(dfPivoted)

In this line, we are printing the transformed dataframe which satisfies the tidy data principles.

3. Each type of observational unit forms a table

Sometimes, a single type of observational unit spread over multiple files or tables. However, as per the tidy data principle – each type of observational unit forms a table, we need to merge these files into a single table.

Suppose, we have two csv files which are as this. Download these csv files from here.

Single observational unit spread over multiple files
Single observational unit spread over multiple files

Here, we can see that both of these files contains the details about a single observation. We can use pandas concat/merge/join methods to merge these two datasets into a single dataset to convert it into a tidy dataset.

We can use this python code.

#Line 1 - Import pandas library under alias pd
import pandas as pd

#Line 2 - Read the ice-cream sale detail csv file using pandas read_csv method
dfIceCreamSale = pd.read_csv(r'D:\\Date-wise ice-cream sale detail.csv')

#Line 3 - Read the temperature detail csv file using pandas read_csv method
dfTempDetail = pd.read_csv(r'D:\\Date-wise temperature detail.csv')

#Line 4 - Merge both these dataframes into a single dataframe
dfMain = pd.merge(left = dfIceCreamSale, right = dfTempDetail, on = 'Date')

#Line 5 - Print the dataframe
print(dfMain)

Output:

Tidy Data output

Let’s discuss each line in more detail.

#Line 1

import pandas as pd

First, we are using an import command to import the pandas library under alias pd.

#Line 2

dfIceCreamSale = pd.read_csv(r'D:\\Date-wise ice-cream sale detail.csv')

Here, we are reading the date-wise ice-cream sale detail csv file into a dataframe named dfIceCreamSale.

#Line 3

dfTempDetail = pd.read_csv(r'D:\\Date-wise temperature detail.csv')

Here, we are reading the date-wise temperature detail csv file into a dataframe named dfTempDetail.

#Line 4

dfMain = pd.merge(left = dfIceCreamSale, right = dfTempDetail, on = 'Date')

Here, we are using pandas merge method to join these two dataframes into a single dataframe. We have defined the left and right dataframes using left and right params respectively. Also, we have defined the on clause to join these datasets based on Date column. By default, merge performs the inner join. However, we can use left, right or full join, if needed.

#Line 5

print(dfMain)

Here, we are printing the dataframe using print method.

Thanks for the reading. Please share your input in the comment section.

Rate This
[Total: 1 Average: 5]

1 thought on “Tidy Data in Python – First Step in Data Science and Machine Learning”

  1. Pingback: Building first Machine Learning model using Logistic Regression in Python - Step by Step - SQLRelease

Leave a Comment

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.