data analysis

PySpark Dataframes: Adding a Column with a List of Values

PySpark is a tool that lets you work with big amounts of data in Python. It’s part of Apache Spark, which is known for handling really big datasets. A common thing people need to do when they’re organizing data is to add a new piece of information to a table, which in the world of […]

PySpark Dataframes: Adding a Column with a List of Values Read More »

Dynamically Create Spark DataFrame Schema from Pandas DataFrame

Apache Spark has become a powerful tool for processing large-scale data in a distributed environment. One of its key components is the Spark DataFrame, which offers a higher-level abstraction over distributed data and enables efficient data manipulation. Spark DataFrame is typically used to manipulate large amounts of data in a distributed environment. When working within

Dynamically Create Spark DataFrame Schema from Pandas DataFrame Read More »

Python Regex – re match vs re search vs re findall

Python Regular expressions, known as regex, are a powerful tool for pattern matching and string manipulation. Python provides a built-in module called re that allows us to use regular expressions. This module offers several functions for performing various regex operations, including matching, searching, and finding all occurrences of a pattern. In this blog post, we

Python Regex – re match vs re search vs re findall Read More »

Displaying Long Strings in Pandas: How to Print Complete Text in DataFrame Without Truncation

Introduction While working with pandas DataFrames, we may get the truncated text data especially if the data size is large. The truncation of the text data while displaying can create difficulties when attempting to thoroughly analyze the complete content. This is frustrating, especially when the text contains important details that are crucial for the analysis.

Displaying Long Strings in Pandas: How to Print Complete Text in DataFrame Without Truncation Read More »

The Easiest Way to Display All Columns of a Pandas DataFrame

In the domain of data analysis and manipulation, pandas is a powerhouse library in Python. However, when working with larger datasets or complex dataframes, displaying all columns can be a challenging task. When we display the content of a pandas dataframe, pandas try to fit all the dataframe columns on the screen. As a result,

The Easiest Way to Display All Columns of a Pandas DataFrame Read More »

Simplify Data Analysis: One-Hot Encoding for Multi-Valued Categorical Variables in Pandas DataFrame

Categorical variables are very common data types in machine learning datasets. These variables represent non-numeric values such as days of the week, gender, colors, etc. However, typically, we need to convert these categorical variables to a numerical format before using them in machine learning algorithms. One-hot encoding is a powerful technique that accomplishes this transformation

Simplify Data Analysis: One-Hot Encoding for Multi-Valued Categorical Variables in Pandas DataFrame Read More »

Handling exceptions: Rollback pandas dataframe’s to_sql operation

Pandas is one of the most popular Python libraries that is used for data manipulation and for data analysis. It provides very convenient and useful methods to analyze tabular data. One of Pandas dataframe’s essential functions is its to_sql method that allows seamless integration with various databases. However, it’s crucial to understand how to handle

Handling exceptions: Rollback pandas dataframe’s to_sql operation Read More »

Create pandas dataframe from MongoDB collection

In this post, we will learn how we can create pandas dataframe from MongoDB collection. MongoDB is a popular NoSQL database that stores data in a JSON-like format and offers a flexible and scalable solution for managing large volumes of data. When working with data stored in MongoDB, it is often necessary to analyze and

Create pandas dataframe from MongoDB collection Read More »

PII Data Identification using Presidio Open Source ML Library

In today’s digital age, organizations deal with large amounts of sensitive data that includes PII data such as names, addresses, phone numbers, and email addresses. Protecting this data is critical to prevent identity theft and other types of fraud, and PII detection is a key step in the process. In this post, we will discuss

PII Data Identification using Presidio Open Source ML Library Read More »

Fill null with the next not null value – Spark Dataframe

In this post, we discussed how to fill a null value with the previous not-null value in a Spark Dataframe. We have also discussed how to extract the non-null values per group from a spark dataframe. Now, in this post, we will learn how to fill a null value with the next available not-null value

Fill null with the next not null value – Spark Dataframe Read More »