Python use case – Import zipped file without unzipping it in SSIS and SQL Server – SQL Server 2017


Import zipped CSV file without unzipping it in SSIS using SQL Server 2017

SQL Server Integration Services (SSIS) is one of the most popular ETL tools. It has many built-in components which can be used in order to automate the enterprise ETL(Extract, Transform, and Load). Also, if we need a customized component which is not available in SSIS, we can simply create it by writing our own piece of code in C# using Script Task or Script Component.

In this post, we are going to explore that how we can read and load a zipped CSV file in SQL Server without unzipping it using SSIS along with SQL Server 2017. Reading a zipped file directly (without unzipping it) will save some time required in order to write the text file on the physical disk and then reading it from there. As of now, we don’t have any built-in component in SSIS which allows us to read a zipped file without unzipping it. For each file, the typical steps to read and load a zipped file in SQL Server table using SSIS are:

  1. Unzip the compressed File – Decompress the zipped file using process task (using command line utility) at a given location
  2. Read the Raw CSV Data – Read the decompressed file with the defined schema
  3. Apply Transformations – Apply the required transformation on the data
  4. Load the Data – Load the data into a SQL table
  5. Perform Cleanup – Delete the decompressed file from the given location

Now, if the zipped file is highly compressed and is of large size, the unzip process will take a long time to unzip the file as the process needs to perform on-disk I/O operations which are indeed costly. Also, the first and the last steps mentioned above are not part of the main ETL process by any means. We need to implement and execute the unzip and cleanup steps to get the original raw file followed by the cleanup process to save the disk space only. If we could manage to read the zipped files on-fly, we can easily save the unzip and cleanup steps.

With the integration of Python in SQL Server 2017, we can read a zipped file very easily using pandas module. We can use pandas read_csv function to import an external zipped CSV file into SQL Server 2017 using sp_execute_external_script system stored procedure. pandas provides read_csv method which enables us to read various types of compressions on-fly for example ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’. If you want to read that how we can import a CSV file directly in SQL Server using T-SQL, click here.

Assume that we have a zipped CSV file with bzip2 compression and we need to load this file into a SQL table. This is the sample CSV file. You can download the sample zipped file from here (extract the zipped file to get the bz2 compressed file).

Sample zipped CSV file

Sample zipped CSV file

To keep this demo simple, we are not going to use any transformation on the raw data and will be loading the data as is into a SQL table. To create an SSIS package which can load the zipped file on-fly, we need an SQL Server 2017 service instance up and running. Once we have the SQL Server 2017 installed, we can follow these steps:

  • Create a new Integration Services Project in visual studio and give a name to the solution. In our case, we have named it as “LoadZippedFileOnFly“.
  • Add a new package and give it a name. In this demo, we have named it as “ReadZippedFile.dtsx“.
  • Drag and drop a “Data Flow Task” to the package and rename it. For example, “DFT – Read and Load zipped file on-fly“.
  • Double click on the DFT to open it.
  • Create a new OLEDB connectionODB – SQLServer2017” which uses the SQL Server 2017 instance as a Server in “DFT – Read and Load zipped file on-fly“. We can use any database here but remember that the database is configured to allow the execution of external scripts.
  • Add a new OLEDB source componentODBS – Read zipped file” to the DFT using OLEDB connection “ODB – SQLServer2017“.
  • In the Data access mode drop-down, select “SQL command” as an option.
  • Add this script as a source query in SQL command text.
  • EXEC sp_execute_external_script
    @language = N'Python',
    @script =
    N'
    #Line 1
    import pandas as pd
    
    #Line 2
    import datetime as datetime
    
    #Line 3
    OutputDataSet = pd.read_csv("D:\Sample CSV File for pandas read_csv.txt.bz2", names = ["PersonID", "FullName", "PreferredName", "SearchName", "IsPermittedToLogon", "Age", "AddedOn"], dtype = {"AddedOn":datetime.date}, header = 0, compression = "bz2")
    '
    ,@input_data_1 = N''
    ,@input_data_1_name = N''
    WITH RESULT SETS
    (
    	(
    		PersonID INT,
    		FullName VARCHAR(512),
    		PreferredName VARCHAR(512),
    		SearchName VARCHAR(512),
    		IsPermittedToLogon bit,
    		Age INT,
    		AddedOn DATE
    	)
    )
    
  • OLEDB source component

    OLEDB source component

  • Add a new OLEDB Destination componentODBD – Load into SQL Table” with the OLEDB connection “”ODB – SQLServer2017” to the DFT.
  • In Data access mode, select Table or view – fast load. From the Name of table or the view drop-down, choose an existing table with the similar schema or create a new one by clicking on the New button. We have used this script to create a sample table “tbl_sample_csv“.
  • IF OBJECT_ID('dbo.tbl_sample_csv') IS NOT NULL
    	DROP TABLE dbo.tbl_sample_csv
    GO
    --Create table now
    CREATE TABLE tbl_sample_csv 
    (
        PersonID int,
        FullName varchar(512),
        PreferredName varchar(512),
        SearchName varchar(512),
        IsPermittedToLogon bit,
        Age int,
        AddedOn date
    )
    
  • Click on the Mapping tab and map the columns accordingly.
  • Execute the package.

Once, we execute the package the package should be executed successfully and the data should be loaded into the SQL Server table.

Package execution output

Package execution output

Let’s have a look at the SQL table dbo.tbl_sample_csv.

tbl_sample_csv Data

tbl_sample_csv Data

Using the power of Python in SQL Server 2017, we can read zipped text files on-fly and hence can improve the package performance also. You can download the sample SSIS project from here.

Import zipped CSV file on-fly in SQL Server 2017 directly

Now, if we want to read a zipped file without unzipping it and load it to a SQL table from SQL Server itself, we can use this script in SQL Server 2017:

--Create sample table
IF OBJECT_ID('dbo.tbl_sample_csv') IS NOT NULL
	DROP TABLE dbo.tbl_sample_csv
GO
CREATE TABLE tbl_sample_csv 
(
    PersonID int,
    FullName varchar(512),
    PreferredName varchar(512),
    SearchName varchar(512),
    IsPermittedToLogon bit,
    Age int,
    AddedOn date
)
GO

--Read the zipped file on-fly and load the data
INSERT INTO dbo.tbl_sample_csv
EXEC sp_execute_external_script
@language = N'Python',
@script =
N'
#Line 1
import pandas as pd

#Line 2
import datetime as datetime

#Line 3
OutputDataSet = pd.read_csv("D:\Sample CSV File for pandas read_csv.txt.bz2", names = ["PersonID", "FullName", "PreferredName", "SearchName", "IsPermittedToLogon", "Age", "AddedOn"], dtype = {"AddedOn":datetime.date}, header = 0, compression = "bz2")
'
GO

--Verify the table data
SELECT * FROM dbo.tbl_sample_csv
GO

How it is working

Let’s discuss the used Python code line by line.

#Line 1

import pandas as pd

In this line, we are just importing the pandas module under alias pd.

#Line 2

import datetime as datetime

In line 2, we are importing the datetime module with alias datetime. We are using the datetime.date datatype to read the AddedOn column.

#Line 3

OutputDataSet = pd.read_csv("D:\Sample CSV File for pandas read_csv.txt.bz2", names = ["PersonID", "FullName", "PreferredName", "SearchName", "IsPermittedToLogon", "Age", "AddedOn"], dtype = {"AddedOn":datetime.date}, header = 0, compression = "bz2")

In this line, we are using the pandas read_csv method to read the zipped csv file on-fly. Here are the details of the parameters we are using:

  1. “D:\Sample CSV File for pandas read_csv.txt.bz2” – This is the zipped file source path
  2. names = [“PersonID”, “FullName”, “PreferredName”, “SearchName”, “IsPermittedToLogon”, “Age”, “AddedOn”] – Passed columns names for the pandas data frame as a Python list.
  3. dtype = {“AddedOn”:datetime.date} – Defined data type for column AddedOn which is of datetime data type.
  4. header = 0 – We are setting the first row as the header of the text file.
  5. compression = “bz2” – We are defining the compression type of the zipped file.

Thanks for the reading. Please share your inputs in the comments.

Rate This
[Total: 1    Average: 5/5]

Gopal Krishna Ranjan

About Gopal Krishna Ranjan

Gopal has 8 years of industry experience in Software development. He has a head down experience in Data Science, Database, Data Warehouse, Big Data and cloud technologies and has implemented end to end solutions. He has extensively worked on SQL Server, Python, Hadoop, Hive, Spark, Azure, Machine Learning, and MSBI (SSAS, SSIS, and SSRS). He also has good experience in windows and web application development using ASP.Net and C#.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.