When we start working on a Machine Learning/Data Science project, first we need to perform some data cleaning and data transformation to get a tidy dataset. Then, we need to perform some EDA(Exploratory Data Analysis) to find trends, patterns, and outliers in the given data. Once, we have a machine interpretable data in place, we choose an algorithm and train the model. Then, we evaluate it on the test data. Next, we can tune the hyper parameters of the model and retrain it to get a robust model. Once the model performance is acceptable, we deploy it to make predictions. Typically, we follow these steps in a Machine Learning model creation:

In this post “**Building first Machine Learning model using Logistic Regression in Python**“, we are going to create our first machine learning predictive model in a step by step way. We will be using scikit-learn library and its standard dataset for demonstration purpose. Let’s have a quick look at the dataset which we are going to use.

### Wisconsin Breast Cancer dataset

Wisconsin Breast Cancer dataset is a standard, preprocessed, cleaned binary classification dataset comes with Scikit-learn library. This dataset contains 569 samples (212 – malignant, 357 – benign). Each sample has 30 features (independent variables). The target variable (dependent or output variable) contains the stage of Breast cancer – **0**(**malignant), 1(benign)**. We have to create a model which can predict whether a given unseen sample is malignant or benign.

Below is the attributes information:

Ten real-valued features are computed for each cell nucleus:

- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area – 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension (“coastline approximation” – 1)

**Mean**, **standard error**, and **worst** values of above ten attributes are computed for each image.

Let’s have a look at sample data:

### Data preprocessing and Exploration

Let’s load this dataset and analyze it in python using pandas dataframe.

from sklearn import datasets #import datasets from sklearn library import pandas as pd #import pandas under alias pd data = datasets.load_breast_cancer() #load breast cancer dataset in a variable named data

The variable named “data” is of type **<class ‘sklearn.utils.Bunch’>** which is a dictionary like object. It has five keys/properties which are:

**DESCR**– Displays the description of the dataset**data**– Contains input features data in a numpy array with shape (569, 30)**feature_names**– Contains the name of the features**target**– Contains the target values (dependent variable values) for each 569 rows – shape (569, )**target_names**– Contains name of the target classes

We can access these properties using syntax like:

**data.property_name>** or **data[‘<property_name>’]**.

We can see that the input feature values and target variable values are stored separately. This is because the Machine Learning algorithm expects input features and target variables in two different arrays.

Pandas dataframe is very powerful and handy tool used for data analysis. It has many built-in methods and properties which makes the data analysis process very smooth. So, let’s create a dataframe using this data to have a quick look at feature and target values of this dataset.

df = pd.DataFrame(data.data, columns = data.feature_names) #create a dataframe df with features as column names

To display a quick summary of the features:

print(df.head()) #print top 5 rows of the dataframe

**Output:**

Let’s generate a quick overview of the column names, their data types, non null value counts, and memory usage using **.info()** method. Also, generate a quick statistical summary of the input columns using **.describe()** method.

print(df.info()) #print column name, datatypes and not null value counts for each column print(df.describe()) #print statistical summary of the columns

**Output:**

We can also use **.isnull()** and **.isna()** methods to verify the Null and NaN(Not a number) values in this dataset:

print(df.isnull().sum()) print(df.isna().sum())

**Output:**

We can see that all the features in this dataset are numeric which is required in order to use it in a Machine Learning model. Also, we don’t have any null or NaN values in this dataset. So, we can say that this dataset is satisfying the tidy data principles and it can be used in a Machine Learning model. However, before fitting this data into the model, let’s do some EDA on this dataset.

### EDA(Exploratory Data Analysis)

Let’s plot histogram for each feature. We can use this script:

#import pyplot from matplotlib library import matplotlib.pyplot as plt #Create a function to draw histograms for each feature def draw_hist_all(): #Lets split the dataframe in 3 dataframes - (1 - Mean, 2 - Standard Error, 3 - Worst) df1 = df.iloc[:,0:10] df2 = df.iloc[:,10:20] df3 = df.iloc[:,20:30] #Draw histogram of all features _ = df1.hist(xlabelsize = 8, ylabelsize = 8, bins = 4, figsize = (6, 4)) _ = plt.tight_layout() _ = df2.hist(xlabelsize = 8, ylabelsize = 8, bins = 4, figsize = (6, 4)) _ = plt.tight_layout() _ = df3.hist(xlabelsize = 8, ylabelsize = 8, bins = 4, figsize = (6, 4)) _ = plt.tight_layout() plt.show() draw_hist_all()

**Output:**

First, we have created three dataframes (having 10 columns in each dataframe) by splitting the main dataframe. Then, we have used **.hist()** method of the dataframe to plot histogram of each feature.

We can also apply some more EDA techniques on this dataframe before fitting this data into our Machine learning model. Visit this link to know more EDA(Exploratory data analysis) techniques.

We can also do some feature engineering before fitting this data into a Machine Learning model. However, in this post, we are not going to demonstrate feature engineering.

### What is Logistic Regression

In spite of its name, Logistic regression is used in classification problems and not in regression problems. It is a binomial regression which has a dependent variable with two possible outcomes like True/False, Pass/Fail, healthy/sick, dead/alive, and 0/1.

**Types of Logistic Regression**

**Binary Logistic Regression:**The target variable has two possible outcomes only.**Multinomial Logistic Regression:**The target variable has three or more classes without ordering.**Ordinal Logistic Regression:**The target variable has three or more categories with ordering.

To know more about Logistic regression, visit this link.

### Building Logistic Regression – First Classification model using scikit-learn

First, we need to split the given data in two parts, **training dataset** and **test dataset**. We will be using training dataset to train the model and then we will be using the test dataset to evaluate the model performance. It is always recommended to have some unseen data to evaluate model performance. This is the code to train and predict the model on test dataset (Use this code in addition to the above code lines).

#import required modules from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression #Assign the feature and target data in two different variables x_data = data.data y_data = data.target #Split the input dataset into two parts: Training dataset and Test dataset (x_train, x_test, y_train, y_test) = train_test_split( \ x_data, y_data, stratify = y_data, test_size = 0.3, random_state = 21) #Instantiate a logistic regression model logreg = LogisticRegression() logreg.fit(x_train, y_train) #Fit method is used to train the model with training dataset y_pred = logreg.predict(x_test) #Predict method is used to predict the outcome on unseen data

### Accuracy check using Confusion matrix

Now, its time to check the accuracy of our classification model. We can use different matrix to check the accuracy of our model like **confusion matrix, classification report, accuracy score, and roc_auc_score**.

Let’s check using few of the above mentioned matrix:

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score #Print confusion matrix print('Confusion matrix is as:') print(confusion_matrix(y_test, y_pred)

**Output:**

As we have a binary classifier here, the dimension of this matrix is 2 x 2. We have two classes 0 and 1. The diagonal values of the matrix are representing the accurate predictions and the non-diagonal values are representing the incorrect predictions. We can also use a dataframe to print this confusion matrix in more readable form:

print(pd.DataFrame(confusion_matrix(y_test, y_pred), index = ['Actual 0', 'Actual 1'], columns = ['Predicted 0', 'Predicted 1' ]))

**Output:**

We have predicted 58 values as 0 and 6 values as 1 out of 64 values which are 0(row 1). Also, we have predicted 3 values as 0 and 104 values as 1 out of 107 values which are 1 (row 2).

Let’s print the classification matrix and accuracy score of the model:

print('Classification report is as:') print(classification_report(y_test, y_pred)) acc = accuracy_score(y_test, y_pred) print('Accuracy of model is {0}'.format(acc))

**Output:**

We can see that the accuracy of this classification model is 94.7% approximately which is an acceptable score. However, we can further improve the accuracy of this model using feature engineering and other techniques.

Thanks for reading. Please share your inputs in the comments.