Python use case – Resampling time series data (Upsampling and downsampling) – SQL Server 2017


Resampling time series data in SQL Server using Python’s pandas library

In this post, we are going to learn how we can use the power of Python in SQL Server 2017 to resample time series data using Python’s pandas library. Sometimes, we get the sample data (observations) at a different frequency (higher or lower) than the required frequency level. In such kind of scenarios, we need to modify the frequency of the given samples as per the frequency of the required outcome. Modifying the frequency of time series data using T-SQL query becomes a tedious task especially when we need to perform upsampling as we need to generate more rows than what we have in the sample dataset. The Python’s pandas module has in-built capabilities for frequency conversion. With the help of pandas resample method, we can increase or decrease the time series observation frequencies with only few lines of code.

Time Series Data Resampling

Time series data resampling is the process of increasing or decreasing the frequency of the time series data using interpolation schemes or by applying statistical methods. When we apply resampling method on time series data, it generates the fine grained (up or down) sampling distribution on the basis of actual sample. For example, if we have an observation at month level frequency and want to generate samples for each day, we can use resampling techniques. Time series data can be resampled in two ways:

  1. Upsampling and
  2. Downsampling

To understand the upsampling and downsampling, let’s have a look at this image.

Upsampling and downsampling

Upsampling and downsampling

Let’s discuss each of the time series resampling methods in more detail.

Upsampling

Upsampling is a process where we generate observations at more granular level than the current observation frequency. In upsampling, we increase the date-time frequency of the given sample. For example, changing the frequency from:

  1. Yearly to quarterly
  2. Quarterly to monthly
  3. Monthly to weekly
  4. Weekly to daily, and so on.

We can upsample the data from any upper level frequency to a more fine grained frequency level. As we move to a more granular level frequency, we get some empty observations also. To fill these empty observations with meaningful values, we can use interpolation techniques like linear interpolation, forward fill, backward fill, and etc.

Upsampling Example

Suppose, we have a table named tbl_sale having saledate and amount columns which stores:

saledate – Date of sale and,

amount – Amount of sale for a given date

tbl_sale - Sample Data

tbl_sale – Sample Data

To create the sample table with dummy data, use this script.

IF OBJECT_ID('dbo.tbl_sale') IS NOT NULL
	DROP TABLE dbo.tbl_sale
GO
CREATE TABLE dbo.tbl_sale
(
	saledate DATETIME NOT NULL,
	amount NUMERIC(18, 2) NOT NULL
)
GO

INSERT INTO dbo.tbl_sale
(saledate, amount)
VALUES('20180107', 553.25),
('20180114', 412.12),
('20180121', 550.22),
('20180128', 222.45),
('20180204', 521.10)
GO

Weekly to daily frequency conversion

Let’s assume that we need to create daily samples from the given weekly samples. We can use pandas resample method to change the frequency from weekly to daily. Also, we can apply an interpolation scheme to fill the empty observations generated as a result of moving the frequency to a more granular level.

We can use the below code to generate the daily samples from the weekly data.

EXEC sp_execute_external_script
@language = N'Python'
,@script =
N'
#Line 1 & 2 - Import pandas and datetime modules
import pandas as pd
from datetime import datetime 

#Line 3 - Convert datatype of saledate column from string to datetime
df["saledate"] = pd.to_datetime(df["saledate"])

#Line 4 - Set saledate column as index column of the dataframe and remove the duplicate column
df = df.set_index("saledate", drop = True)

#Line 5 - Resample the data at daily basis frequency and fill the empty observations using lineat interpolation
df = df.resample("D").interpolate("linear")

#Line 6 - Now, reset the index to range index (For SQL Server compatibility)
df = df.reset_index()

#Line 7 - Change the datatype of saledate column from datetime to string (For SQL Server compatibility)
df["saledate"] = df["saledate"].astype(str)

#Line 8 - Assign the dataframe df to the output dataframe OutputDataSet (Specific to SQL Server)
OutputDataSet = df
'
,@input_data_1 = N'SELECT 
					CAST(saledate AS VARCHAR(50)) AS saledate, CAST(amount as FLOAT) AS amount 
					FROM dbo.tbl_sale'
,@input_data_1_name = N'df'
WITH RESULT SETS(("Sale Date" DATETIME, "Amount" DECIMAL(18,2)))
GO

Output

Upsampled data

Upsampled data (Truncated to fit)

Here, we can see that we have samples for each date starting from the first date of the sample in the dataset (07-Jan-2018) to the last date in the dataset (04-Feb-2018). As we have used linear interpolation, the newly generated observations have been filled with the equally distributed values lying between the previous and the next week amounts.

Now, let’s discuss each line of code in detail.

#Line 1 and 2

In line 1 and line 2, we are importing the pandas and datetime modules being used in the python script.

#Line 3

As of now, SQL Server does not support datetime data types in input datasets. That is why we have converted the saledate column as string values while passing it to the external script. However, we need to convert the saledate values as datetime data type in order to apply resample method on this column.

#Line 4

In this line, we move the saledate column to index column of the dataframe and remove the duplicate column from the regular columns of the dataframe.

#Line 5

In this line, we have used the dataframe’s resample method to generate daily samples from the weekly samples. Also, we have used linear interpolation technique to fill the newly generated empty rows.

#Line 6

As of now, SQL Server does not support dataframe’s index column and that is why we need to move the saledate column from the index column to a regular dataframe column.

#Line 7

As of now, SQL Server does not support datetime data type in output dataset. That is why we need to convert the saledate column into string values before returning it back to the SQL Server.

#Line 8

Finally, we are assigning the dataframe df to an output dataframe named OutputDataSet.

Downsampling – Time series data

Downsampling is a process where we generate observations at more aggregate level than the current observation frequency. In downsampling, we decrease the date-time frequency of the given sample. For example, changing the frequency from:

  1. Daily to weekly
  2. Weekly to monthly
  3. Monthly to quarterly
  4. Quarterly to yearly, and so on.

We can use aggregation methods to aggregate the data at a lower level. We can apply SUM, MIN, MAX, Count, MEAN, MEDIAN, MODE, and other statistical techniques to aggregate the observations.

Downsampling Examples

Assume that we get daily samples and we need to generate weekly samples from it. The output data which we have generated above in this blog using upsampling method, have been pushed into a table named tbl_sale_daily. Now, let’s change the frequency from daily to weekly using resample method.

Daily to weekly frequency conversion

To generate the weekly samples from the daily samples, we can use below code.

EXEC sp_execute_external_script
@language = N'Python'
,@script =
N'
#Line 1 & 2 - Import pandas and datetime modules
import pandas as pd
from datetime import datetime 

#Line 3 - Convert datatype of saledate column from string to datetime
df["saledate"] = pd.to_datetime(df["saledate"])

#Line 4 - Set saledate column as index column of the dataframe and remove the duplicate column
df = df.set_index("saledate", drop = True)

#Line 5 - Resample the data at daily basis frequency and fill the empty observations using lineat interpolation
df = df.resample("W").mean()

#Line 6 - Now, reset the index to range index (For SQL Server compatibility)
df = df.reset_index()

#Line 7 - Change the datatype of saledate column from datetime to string (For SQL Server compatibility)
df["saledate"] = df["saledate"].astype(str)

#Line 8 - Assign the dataframe df to the output dataframe OutputDataSet (Specific to SQL Server)
OutputDataSet = df 
'
,@input_data_1 = N'SELECT 
					CAST(saledate AS VARCHAR(50)) AS saledate, CAST(amount as FLOAT) AS amount 
					FROM dbo.tbl_sale_daily'
,@input_data_1_name = N'df'
WITH RESULT SETS(("Sale Date" DATETIME, "Amount" DECIMAL(18,2)))
GO

Output

Weekly sample data

Weekly sample data

In above image, we can see that, we get the original dataset back when we changed the frequency of the dataset from daily to weekly using downsampling method. We have used similar Python code as we have used in upsampling while performing the downsampling. The main difference is that in upsampling we have used interpolation technique to fill the newly created gaps in the datetime frequency. However, we have used mean() method to aggregate the data on weekly basis.

Thanks for the reading. Please share your inputs in comments.

Rate This
[Total: 0    Average: 0/5]

Gopal Krishna Ranjan

About Gopal Krishna Ranjan

Gopal has 8 years of industry experience in Software development. He has a head down experience in Data Science, Database, Data Warehouse, Big Data and cloud technologies and has implemented end to end solutions. He has extensively worked on SQL Server, Python, Hadoop, Hive, Spark, Azure, Machine Learning, and MSBI (SSAS, SSIS, and SSRS). He also has good experience in windows and web application development using ASP.Net and C#.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.