Python use case – Map unequal comma separated values from two columns – SQL Server 2017


We know that Microsoft has integrated Python in SQL Server 2017 to enable rich data analytics capabilities within the database itself. Python is one of the most powerful languages which provides lots of built-in libraries for advanced data analytics and transformations. We can use Python for almost everything from website development to robotics and Data Science. In SQL Server 2017, Python can be used primarily for Machine learning purposes but it is not limited to that only. We can also use Python for complex data transformations and analysis which might be a bit tedious and complex while doing the same using T-SQL in SQL Server.

In this post, we will be exploring an use case example of Python for data transformation in SQL Server 2017. If you want to read more about Python and how to use it in SQL Server, you can visit my previous blog post “Why Python and how to use it in SQL Server 2017“. If you want to explore new features of SQL Server 2017, visit this post “SQL Server 2017 New Features“. To know, why you should learn Python, read this stack overflow article “The Incredible Growth of Python“. Now, let’s continue with the Python use case example.

Python use case – Map unequal comma separated values from two columns

In this example, assume that we have a table named “tbl_EmpProjectManagerDetail” with 3 columns which are “EmpName“, “CommaSepProjectNames“, and “CommaSepManagerNames“. Below is the T-SQL script to create the table and to add some sample data in it.

CREATE TABLE dbo.tbl_EmpProjectManagerDetail
(	
	EmpName varchar(256),
	CommaSepProjectNames varchar(max),
	CommaSepManagerNames varchar(max),
)
GO

INSERT INTO dbo.tbl_EmpProjectManagerDetail
(EmpName, CommaSepProjectNames, CommaSepManagerNames)
VALUES
('John', 'ERP,HR Assistant,Admin Helper', 'Kevin,John,Alen'),
('David', 'HR Assistant,Admin Helper,Airline Management,Traffic Controller,Travel Solutions', 'John,Alen,James,Hedge'),
('Steve', 'Power Management,School Management,One touch Attendance system', 'Stuart,Finch,Broad,Clark')
GO

Let’s have a look at the table data which is as below:

Sample table data

Sample table data

The column EmpName contains the name of employees, CommaSepProjectNames column contains the comma-separated names of projects in an ordered way in which the employee has worked so far, and CommaSepManagerNames column contains the comma-separated names of the managers for these projects in an ordered way. However, the comma-separated list of projects and managers can be unequal, and in this case, the last values from any of these columns need to be mapped to “None” value in the output instead of discarding these values. The final output will have two columns as “EmpName“, and “ProjectManagerMapping“. The column “EmpName” will have the name of the employees and the column “ProjectManagerMapping” will have the list of projects and managers mapped with each other in the order they are recorded in the columns “CommaSepProjectNames” and “CommaSepManagerNames“. Also, in case of unequal length, we need to map the last values from any of these columns with None value. Below is the screenshot of the required output:

Required Output

Required Output

In above image, we can see that in the column “ProjectManagerMapping“, for the employee “David”, no manager has defined for the project “Travel Solutions” (Highlighted in red rectangle in the second row). Also, for the employee “Steve”, no project name has defined for manager “Clark” (Highlighted in red rectangle in the third row). Now, let’s talk about the solution which we are going to implement using Python script which will execute in-database using inbuilt system stored procedure “sp_execute_external_script”.

Map unequal comma separated values from two columns using Python Script in SQL Server 2017

Undoubtedly, we can achieve the above output using T-SQL code in a set based way. However, the purpose of this article is to explore an use case scenario of Python’s data transformation capability which can be used in SQL Server 2017. Also, we are not going to compare the performance of achieving this output using T-SQL or using an in-database Python script. If you want to go through the set based approach techniques in SQL Server, you can visit my video tutorial series which explains how to think in sets in SQL Server.

Let’s have a look at the code being used to achieve the above output using Python script in SQL Server:

EXEC sp_execute_external_script 
@language = N'Python',
@script = 
N'
tblData = InputDataSet #Reads the input data in a data frame received from SQL Server

from itertools import zip_longest
import pandas as pd

lst = []
for pair, emp in zip(zip(tblData.CommaSepProjectNames.str.split(","), tblData.CommaSepManagerNames.str.split(",")), tblData.EmpName):
	lst.append([emp, str(list(" : ".join(t) for t in (list(zip_longest(pair[0], pair[1], fillvalue="None"))))).lstrip("[").rstrip("]").replace("''", "")])

OutputDataSet = pd.DataFrame(lst)
',
@input_data_1 = N'SELECT EmpName, CommaSepProjectNames, CommaSepManagerNames FROM dbo.tbl_EmpProjectManagerDetail'
WITH RESULT SETS(("EmpName" varchar(256), "ProjectManagerMapping" varchar(max)))

How it is working

Let’s discuss the above code in more detail. In the above code, we are using stored procedure “sp_execute_external_script” to execute an external Python script in SQL Server. We are inputting the data from the table “tbl_EmpProjectManagerDetail” to the Python script using “@input_data_1” parameter of the procedure “sp_execute_external_script”. The @script parameter contains the actual Python code to transform the input data into the required output. Let’s talk about the Python code which we are using above.

Python provides a function named zip_longest which resides inside the itertools module and can be used to map values from two iterables considering the longest one. In our Python script, in the first line, we are assigning the input dataframe to a variable named tblData using below line:

tblData = InputDataSet

Then, we are importing the required modules using below lines of code:

from itertools import zip_longest
import pandas as pd

After that, we are creating an empty list because we know that a list is a mutable object which allows modifying the data in place. In this way, a list will help us to improve the performance of the execution of the code.

lst = []

Then, we are using a for loop (which can also be converted into a list comprehension which can outperform the for loop) to transform the data as below.

for pair, emp in zip(zip(tblData.CommaSepProjectNames.str.split(","), tblData.CommaSepManagerNames.str.split(",")), tblData.EmpName):
	lst.append([emp, str(list(" : ".join(t) for t in (list(zip_longest(pair[0], pair[1], fillvalue="None"))))).lstrip("[").rstrip("]").replace("''", "")])

In the above line of code, first, we are combining the values from both of the columns (CommaSepProjectNames, CommaSepManagerNames) using zip function as “zip(tblData.CommaSepProjectNames.str.split(“,”), tblData.CommaSepManagerNames.str.split(“,”))“. After that, we are adding employee names along with the concatenated values of these two columns using an outer zip function as “zip(zip(tblData.CommaSepProjectNames.str.split(“,”), tblData.CommaSepManagerNames.str.split(“,”)), tblData.EmpName)“. Then, we are assigning the output of the outer zip into two variables named “pair, and emp”. Next, we are using a for loop to iterate over for each row of the dataframe. Then inside the loop, we are using “lst.append([emp, str(list(” : “.join(t) for t in (list(zip_longest(pair[0], pair[1], fillvalue=”None”))))).lstrip(“[“).rstrip(“]”).replace(“””, “”)])” to map values from both the columns considering the longest with a filler value “None” for the unmapped values followed by a conversion of the elements of the list into string using join method with some cleanup activity, finally, these values gets appended into the list.

Now, let’s talk quickly about the performance summary of the above code.

Quick Performance Summary

Though I have not compared the performance of the above code with T-SQL set based code. But, in a quick testing, I found that this code is running quite faster than expected. Below is the summary of the quick test. Please note that the execution time is an average of multiple executions.

Performance summary

Performance summary

I will recommend you to perform extensive testing of the above code before moving it to the production environment.

Thanks for the reading. Please share your the input in the comment section of the post.

Rate This
[Total: 1    Average: 5/5]


Gopal Krishna Ranjan

About Gopal Krishna Ranjan

I am Gopal Krishna Ranjan, having 6+ years of industry experience in Software development using Microsoft technologies. I have a head down experience in Database development, performance tuning in SQL Server, T-SQL optimization, BI (Business Intelligence) project implementation, reporting in SSRS, using SSIS for ETL, implementing multi dimensional and tabular data-warehouse in SSAS, querying cubes using MDX and DAX, Windows and Web Applications development with C#.

Leave a comment

Your email address will not be published. Required fields are marked *

*