In this post “Interactive Data Analysis with SQL Server using Jupyter Notebooks“, we will demonstrate how we can use Jupyter Notebooks for interactive data analysis with SQL Server. Jupyter notebooks are one of the most useful tools for any Data Scientist/Data Analyst. It supports 40+ programming languages and facilitates web-based interactive programming IDE. We can put comments, headings, codes, and output in one single document. This document maintains the context to the original data source which means we can re-execute the code whenever we need it. This feature facilitates Data scientists/Data analysts to play with the code during the presentations. Also, these notebooks are very handy in sharing and can be shared easily across the teams.
In this post, we are going to discuss how we can export SQL Server table data to an Excel file or to a CSV file using Python’s pandas library. Prior to SQL Server 2017, we could use one of the below methods to export data from SQL Server to Excel or CSV file:
- Create an SSIS package to export the data from SQL Server – This option can be a good choice if we want to reuse the export process again and again. Also, if we want to put moderate/complex transformations during data export, this option can be a better choice.
- Use SQL Server Import/export wizard – SQL Server provides in-built data export/import wizard which can be used in case we want to export data with no/minimal transformations.
- OS-based copy-paste functionality – We can simply copy the query output from SQL Server using Ctrl + C option and then open
Apache Spark is a general-purpose big data processing engine. It is a very powerful cluster computing framework which can run from a single cluster to thousands of clusters. It can run on clusters managed by Hadoop YARN, Apache Mesos, or by Spark’s standalone cluster manager itself. To read more on Spark Big data processing framework, visit this post “Big Data processing using Apache Spark – Introduction“. Here, in this post, we will learn how we can install Apache Spark on a local Windows Machine in a pseudo-distributed mode (managed by Spark’s standalone cluster manager) and run it using PySpark (Spark’s Python API).
Install Spark on Local Windows Machine
To install Apache Spark on a local Windows machine, we need to follow below steps:
Step 1 – Download and install Java JDK 8
Java JDK 8 is required as a prerequisite for the Apache Spark installation. We … More
Once we have installed the Jupyter notebook, we can start it by executing “jupyter notebook” command in the command prompt on a Windows machine or in the terminal on a Mac machine. Jupyter notebook is a very useful web-based application which can be used to write programs in many programming languages like Python, R, Scala, Julia, and etc. The notebooks created in jupyter can be shared easily with other users over email, Git, and DropBox. We can use jupyter notebooks to write code in an interactive mode which can be very handy to re-run individual snippets. It is nicely integrated with Big Data ecosystem and with cloud platforms also.
When we start the jupyter notebook server, it shows the notebooks from the current working directory from which the notebook server is started. That is why the default working directory of a Jupyter notebook server is … More
In this post “Python use case – Save each worksheet as a separate excel workbook“, we are going to learn that how we can create a separate workbook for each worksheet of a given excel file. We will be copying data, values, formatting and all other settings of the sheet in the newly created workbook. Assume that we have an excel workbook having five sheets named “Sheet1“, “Sheet2“, “Sheet3“, “Sheet4“, and “Sheet5” which is as this.
Now, we want to create 5 different excel workbooks (one for each worksheet).
We are going to use Excel application to copy data, values, formatting and other settings of each sheet as a new workbook in a python script. Below is the python code.
To execute the above python script, we can call this .py file using command prompt window as this.… More
In previous post, we created our first Machine Learning model using Logistic Regression to solve a classification problem. We used “Wisconsin Breast Cancer dataset” for demonstration purpose. Now, in this post “Building Decision Tree model in python from scratch – Step by step”, we will be using IRIS dataset which is a standard dataset that comes with Scikit-learn library. Let’s have a quick look at IRIS dataset.
The IRIS dataset
The IRIS dataset is a multi-class classification dataset introduced by British statistician and biologist Ronald Fisher in 1936. This dataset has 150 observations which consists 50 samples of each of three species of Iris flower which are “setosa“, “versicolor” or “virginica“. It is a standard, cleansed and preprocessed multivariate dataset which comes preloaded with Scikit-learn library. Each sample has four input features which are:
- Sepal length (cm)
- Sepal width (cm)
- Petal length (cm)
When we start working on a Machine Learning/Data Science project, first we need to perform some data cleaning and data transformation to get a tidy dataset. Then, we need to perform some EDA(Exploratory Data Analysis) to find trends, patterns, and outliers in the given data. Once, we have a machine interpretable data in place, we choose an algorithm and train the model. Then, we evaluate it on the test data. Next, we can tune the hyper parameters of the model and retrain it to get a robust model. Once the model performance is acceptable, we deploy it to make predictions. Typically, we follow these steps in a Machine Learning model creation:
In this post “Building first Machine Learning model using Logistic Regression in Python“, we are going to create our first machine learning predictive model in a step by step way. We will be using scikit-learn library … More
In the previous post, “Tidy Data in Python – First Step in Data Science and Machine Learning”, we discussed the importance of the tidy data and its principles. In a Machine Learning project, once we have a tidy dataset in place, it is always recommended to perform EDA (Exploratory Data Analysis) on the underlying data before fitting it into a Machine Learning model. Let’s start understanding the importance of EDA and some basic EDA techniques which are very useful.
What is Exploratory Data Analysis (EDA)
Exploratory Data Analysis or EDA, is the process of organizing, plotting and summarizing the data to find trends, patterns, and outliers using statistical and visual methods. It takes input data from a tabular format and represents it in a graphical format which makes it more human interpretable. It is an important step in a Machine Learning/Data Science project which should be performed before … More
Resampling time series data in SQL Server using Python’s pandas library
In this post, we are going to learn how we can use the power of Python in SQL Server 2017 to resample time series data using Python’s pandas library. Sometimes, we get the sample data (observations) at a different frequency (higher or lower) than the required frequency level. In such kind of scenarios, we need to modify the frequency of the given samples as per the frequency of the required outcome. Modifying the frequency of time series data using T-SQL query becomes a tedious task especially when we need to perform upsampling as we need to generate more rows than what we have in the sample dataset. The Python’s pandas module has in-built capabilities for frequency conversion. With the help of pandas resample method, we can increase or decrease the time series observation frequencies with only few lines of … More
If we need to import data from an excel file into SQL Server, we can use these methods:
- SQL Server Import Export Wizard
- Create an SSIS package to read excel file and load data into a SQL Server table
- Use T-SQL OPENROWSET query
- Use the read_excel method of Python’s pandas library (Only available in SQL Server 2017 onwards)
In this post “Python use case – Import data from excel to sql server table – SQL Server 2017”, we are going to learn that how we can use the power of Python in SQL Server 2017 to read a given excel file in a SQL table directly. With the integration of Python in SQL Server 2017, we can use the pandas read_excel method to read a given excel file with lots of customizations in SQL Server.