Gopal Krishna Ranjan

Handling exceptions: Rollback pandas dataframe’s to_sql operation

Leave a Comment / Data Analysis, Python / Gopal Krishna Ranjan / Aug 7, 2023 / data analysis, data preprocessing, mysql, pandas, python

Pandas is one of the most popular Python libraries that is used for data manipulation and for data analysis. It provides very convenient and useful methods to analyze tabular data. One of Pandas dataframe’s essential functions is its to_sql method that allows seamless integration with various databases. However, it’s crucial to understand how to handle […]

Handling exceptions: Rollback pandas dataframe’s to_sql operation Read More »

Read and write data from Cosmos DB to Spark

Leave a Comment / Spark / Gopal Krishna Ranjan / Jul 30, 2023 / apache spark, cosmosdb, pyspark, step by step

In the vast and ever-expanding landscape of big data technologies, Apache Spark is an open-source, lightning-fast, and versatile framework that ignites the power of large-scale data analytics. It is a powerful distributed data processing framework that helps us to analyze and derive insights from massive datasets. On the other hand, Cosmos DB is a globally

Read and write data from Cosmos DB to Spark Read More »

Create pandas dataframe from MongoDB collection

Leave a Comment / Data Analysis, Python / Gopal Krishna Ranjan / Jun 29, 2023 / data analysis, MongoDB, python

In this post, we will learn how we can create pandas dataframe from MongoDB collection. MongoDB is a popular NoSQL database that stores data in a JSON-like format and offers a flexible and scalable solution for managing large volumes of data. When working with data stored in MongoDB, it is often necessary to analyze and

Create pandas dataframe from MongoDB collection Read More »

Creating a Wheel File in Python: Simplifying Package Distribution

Leave a Comment / Python / Gopal Krishna Ranjan / May 31, 2023 / python, step by step

In the Python ecosystem, package distribution plays a crucial role in sharing and reusing code efficiently. While Python’s built-in package manager, pip, allows us to install packages effortlessly, sometimes it becomes necessary to distribute our own Python packages. In such cases, wheel files prove to be a valuable asset. A wheel file is a built

Creating a Wheel File in Python: Simplifying Package Distribution Read More »

Optimize Spark dataframe write performance for JDBC

Leave a Comment / Performance Tuning, Spark / Gopal Krishna Ranjan / Apr 30, 2023 / apache spark, big data processing, jdbc

Apache Spark is a popular big data processing engine that is designed to handle large-scale data processing tasks. When it comes to writing data to JDBC, Spark provides a built-in JDBC connector that allows users to write data to various relational databases easily. We can write Spark dataframe to SQL Server, MySQL, Oracle, Postgres, etc.

Optimize Spark dataframe write performance for JDBC Read More »

Create requirements.txt file in Python automatically

Leave a Comment / Python / Gopal Krishna Ranjan / Feb 13, 2023 / big data processing, python, step by step

In this post, we will learn how to create a requirements.txt file for a python project. The requirements.txt file contains the list of all the packages needed to execute the Python project. It is very helpful, especially during the deployment. Using the requirement.txt file, we can automate the deployment of the project to a different

Create requirements.txt file in Python automatically Read More »

PII Data Identification using Presidio Open Source ML Library

Leave a Comment / Machine Learning / Gopal Krishna Ranjan / Feb 11, 2023 / data analysis, data preprocessing, machine learning - step by step, python

In today’s digital age, organizations deal with large amounts of sensitive data that includes PII data such as names, addresses, phone numbers, and email addresses. Protecting this data is critical to prevent identity theft and other types of fraud, and PII detection is a key step in the process. In this post, we will discuss

PII Data Identification using Presidio Open Source ML Library Read More »

Difference between Hadoop 1.x, Hadoop 2.x and Hadoop 3.x

1 Comment / Hadoop / Gopal Krishna Ranjan / Jan 30, 2023 / big data processing, Hadoop, MapReduce

We know that Apache Hadoop is a framework that allows us to perform data processing in a distributed way on very large datasets using commodity computers. That is why, this framework is highly scalable and can scale up from a single machine to thousands of machines. Most importantly, Hadoop is an open source and provides

Difference between Hadoop 1.x, Hadoop 2.x and Hadoop 3.x Read More »

Fill null with the next not null value – Spark Dataframe

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Dec 24, 2022 / big data processing, data analysis, Hadoop, scala

In this post, we discussed how to fill a null value with the previous not-null value in a Spark Dataframe. We have also discussed how to extract the non-null values per group from a spark dataframe. Now, in this post, we will learn how to fill a null value with the next available not-null value

Fill null with the next not null value – Spark Dataframe Read More »

Fill null with the previous not null value – Spark Dataframe

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Dec 19, 2022 / big data processing, data analysis, Hadoop, scala

In the previous post, we discussed how to extract the non-null values per group from a spark dataframe. Now, in this post, we will learn how to fill the null values with the previous not-null value in a spark dataframe using the backfill method. To demonstrate this with the help of an example, we will

Fill null with the previous not null value – Spark Dataframe Read More »