python

Creating a Wheel File in Python: Simplifying Package Distribution

In the Python ecosystem, package distribution plays a crucial role in sharing and reusing code efficiently. While Python’s built-in package manager, pip, allows us to install packages effortlessly, sometimes it becomes necessary to distribute our own Python packages. In such cases, wheel files prove to be a valuable asset. A wheel file is a built […]

Creating a Wheel File in Python: Simplifying Package Distribution Read More »

Create requirements.txt file in Python automatically

In this post, we will learn how to create a requirements.txt file for a python project. The requirements.txt file contains the list of all the packages needed to execute the Python project. It is very helpful, especially during the deployment. Using the requirement.txt file, we can automate the deployment of the project to a different

Create requirements.txt file in Python automatically Read More »

PII Data Identification using Presidio Open Source ML Library

In today’s digital age, organizations deal with large amounts of sensitive data that includes PII data such as names, addresses, phone numbers, and email addresses. Protecting this data is critical to prevent identity theft and other types of fraud, and PII detection is a key step in the process. In this post, we will discuss

PII Data Identification using Presidio Open Source ML Library Read More »

Using Pandas on Spark

Pandas is one of the most popular Python libraries used by Data Scientists/Data Engineers for data wrangling and data analysis. Also, Pandas provide DataFrames (a table-like structure that stores data in rows and columns) to deal with structured datasets. These DataFrames are very similar to Spark’s DataFrames. However, Pandas dataframes are limited to a single

Using Pandas on Spark Read More »

Use HDFS API to read Azure Blob files in Databricks

Databricks provides a wrapper file system API named DBFS (Databricks File System) to perform any file-level operation such as read, write, move, delete, rename, etc. However, sometimes we may need to read the underlying file system objects directly without using the DBFS wrapper APIs. To do so, we can use HDFS APIs available through py4j

Use HDFS API to read Azure Blob files in Databricks Read More »

Get HDFS file location of Hive table records as column

In this post, we will learn how we can extract the physical HDFS file location path of the Hive table as a column along with other columns of the table. We will demonstrate this using HiveQL, PySpark, and Scala. We can create the Hive tables as internal or external tables. So, if we create an

Get HDFS file location of Hive table records as column Read More »

Hyperparameter tuning using GridSearchCV and RandomizedSearchCV in Python

In the previous post, we had a brief discussion about the GridSearchCV and RandomizedSearchCV. Now, in this post, we will demonstrate that how we can use the GridSearchCV and RandomizedSearchCV methods available with the Sci-kit learn library for hyperparameter tuning in Python. We will use the sklearn built-in diabetes dataset in this demo. However, if

Hyperparameter tuning using GridSearchCV and RandomizedSearchCV in Python Read More »

Show full column content in Spark

This post briefs how we can display the full contents of data frame columns in Apache Spark. The default behavior of Spark truncates the column values if it is more than 20 characters. However, sometimes we need to display the full values rather than the truncated data. Having truncated data might not be useful in

Show full column content in Spark Read More »

Spark read file with special characters using PySpark

Suppose, we have a CSV file that contains some non-English characters (Spanish, Japanese, and etc.) and we want to read this file into a Spark data frame. If we read this file without using the right character encoding, we will end up with some junk characters (like �) in the data frame. So, the files

Spark read file with special characters using PySpark Read More »

Read CSV file with Newline character in PySpark

Apache Spark is a Big Data cluster computing framework that can run on Standalone, Hadoop, Kubernetes, Mesos clusters, or in the cloud. We can read and write data from various data sources using Spark. For example, we can use CSV (comma-separated values), and TSV (tab-separated values) files as an input source to a Spark application.

Read CSV file with Newline character in PySpark Read More »