Analytics/ML Archives - Page 2 of 6

Creating a Wheel File in Python: Simplifying Package Distribution

In the Python ecosystem, package distribution plays a crucial role in sharing and reusing code efficiently. While Python’s built-in package manager, pip, allows us to install packages effortlessly, sometimes it becomes necessary to distribute our own Python packages. In such cases, wheel files prove to be a valuable asset. A wheel file is a built […]

Creating a Wheel File in Python: Simplifying Package Distribution Read More »

Create requirements.txt file in Python automatically

Leave a Comment / Python / Gopal Krishna Ranjan / Feb 13, 2023 / big data processing, python, step by step

In this post, we will learn how to create a requirements.txt file for a python project. The requirements.txt file contains the list of all the packages needed to execute the Python project. It is very helpful, especially during the deployment. Using the requirement.txt file, we can automate the deployment of the project to a different

Create requirements.txt file in Python automatically Read More »

PII Data Identification using Presidio Open Source ML Library

Leave a Comment / Machine Learning / Gopal Krishna Ranjan / Feb 11, 2023 / data analysis, data preprocessing, machine learning - step by step, python

In today’s digital age, organizations deal with large amounts of sensitive data that includes PII data such as names, addresses, phone numbers, and email addresses. Protecting this data is critical to prevent identity theft and other types of fraud, and PII detection is a key step in the process. In this post, we will discuss

PII Data Identification using Presidio Open Source ML Library Read More »

Fill null with the next not null value – Spark Dataframe

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Dec 24, 2022 / big data processing, data analysis, Hadoop, scala

In this post, we discussed how to fill a null value with the previous not-null value in a Spark Dataframe. We have also discussed how to extract the non-null values per group from a spark dataframe. Now, in this post, we will learn how to fill a null value with the next available not-null value

Fill null with the next not null value – Spark Dataframe Read More »

Fill null with the previous not null value – Spark Dataframe

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Dec 19, 2022 / big data processing, data analysis, Hadoop, scala

In the previous post, we discussed how to extract the non-null values per group from a spark dataframe. Now, in this post, we will learn how to fill the null values with the previous not-null value in a spark dataframe using the backfill method. To demonstrate this with the help of an example, we will

Fill null with the previous not null value – Spark Dataframe Read More »

Get the first non-null value per group Spark dataframe

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Nov 12, 2022 / big data processing, Hadoop, scala

Suppose, we need to get the first non-null value from a Dataframe from each partition. Certainly, we want to get only the first not null value from each column regardless of the rows. That means a not-null value from column A from row 5 can be stitched with another not-null value of column B from

Get the first non-null value per group Spark dataframe Read More »

Scala Option, Some, None – Exception and Null handling

Leave a Comment / Scala / Gopal Krishna Ranjan / Oct 5, 2022 / big data processing, Hadoop, scala

In the previous post, we discussed the Try, Success, Failure exception handling method. Now, in this post, we will discuss the use of Scala’s Option, Some, None pattern and its usage. Scala is a high-level programming language combining object-oriented and functional programming in one place. It is a very powerful programming language that can be

Scala Option, Some, None – Exception and Null handling Read More »

Scala Try, Success, Failure – Functional error handling

1 Comment / Scala / Gopal Krishna Ranjan / Sep 26, 2022 / big data processing, Hadoop, scala

In this post, we will discuss the Scala’s functional error handling method using Try, Success, Failure. We know that Scala is a high-level programming language that combines both object-oriented and functional programming in one place. It runs on JVM so it can be mixed seamlessly with Java. Scala’s static types helps to identify bugs at

Scala Try, Success, Failure – Functional error handling Read More »

Execute Scala file in Spark without creating a jar

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Aug 17, 2022 / big data processing, Hadoop, scala

This post will teach us how to execute a scala file in Spark without creating a jar file. We know that a scala source code file has an extension of .scala. Also, we need to create or package the source code into a jar file to execute an application written in Scala. We can create

Execute Scala file in Spark without creating a jar Read More »

Using Pandas on Spark

Leave a Comment / Python, Spark / Gopal Krishna Ranjan / Jul 31, 2022 / big data processing, pyspark, python

Pandas is one of the most popular Python libraries used by Data Scientists/Data Engineers for data wrangling and data analysis. Also, Pandas provide DataFrames (a table-like structure that stores data in rows and columns) to deal with structured datasets. These DataFrames are very similar to Spark’s DataFrames. However, Pandas dataframes are limited to a single

Using Pandas on Spark Read More »