Big Data/Cloud Archives

Reading Data from Cosmos DB in Databricks: A Comprehensive Guide

Leave a Comment / Databricks, Python / Gopal Krishna Ranjan / Mar 31, 2024 / azure-databricks, big data processing, python

In today’s data-driven world, organizations leverage various data storage solutions to manage and analyze their data effectively. Cosmos DB, a globally distributed NoSQL database service from Microsoft Azure, is widely used for building highly scalable and responsive applications. In this blog post, we will explore how to read data from Cosmos DB in Databricks, a […]

Reading Data from Cosmos DB in Databricks: A Comprehensive Guide Read More »

PySpark Dataframes: Adding a Column with a List of Values

Leave a Comment / Python, Spark / Gopal Krishna Ranjan / Feb 28, 2024 / big data processing, data analysis, python

PySpark is a tool that lets you work with big amounts of data in Python. It’s part of Apache Spark, which is known for handling really big datasets. A common thing people need to do when they’re organizing data is to add a new piece of information to a table, which in the world of

PySpark Dataframes: Adding a Column with a List of Values Read More »

Dynamically Create Spark DataFrame Schema from Pandas DataFrame

Leave a Comment / Python, Spark / Gopal Krishna Ranjan / Dec 28, 2023 / apache spark, big data processing, data analysis, Hadoop, pyspark, python

Apache Spark has become a powerful tool for processing large-scale data in a distributed environment. One of its key components is the Spark DataFrame, which offers a higher-level abstraction over distributed data and enables efficient data manipulation. Spark DataFrame is typically used to manipulate large amounts of data in a distributed environment. When working within

Dynamically Create Spark DataFrame Schema from Pandas DataFrame Read More »

Git: Step-by-Step Guide to Rebasing the Develop Branch onto Main

Leave a Comment / Big Data/Cloud / Gopal Krishna Ranjan / Oct 23, 2023 / developer guide, programming tips, step by step

Rebasing the develop branch onto the main branch is a popular workflow in Git that allows you to incorporate the latest changes from the main branch into the develop branch while maintaining a linear history. This is very useful especially when working on a project working together with multiple teams and developers. This post provides

Git: Step-by-Step Guide to Rebasing the Develop Branch onto Main Read More »

SQL Server Docker Installation: Step-by-Step Guide for Windows

Leave a Comment / Big Data/Cloud, SQL Server / Gopal Krishna Ranjan / Sep 11, 2023 / azure, Azure DevOps, cloud, docker, sql server 2022, sql server tutorial - step by step

SQL Server is a very popular, powerful, and versatile option in the ever-evolving landscape of database management. It is a robust and widely used relational database management system (RDBMS) developed and managed by Microsoft. SQL Server natively supports SQL (Structured Query Language) for querying and manipulating data stored in the tables. This makes SQL Server

SQL Server Docker Installation: Step-by-Step Guide for Windows Read More »

Read and write data from Cosmos DB to Spark

Leave a Comment / Spark / Gopal Krishna Ranjan / Jul 30, 2023 / apache spark, cosmosdb, pyspark, step by step

In the vast and ever-expanding landscape of big data technologies, Apache Spark is an open-source, lightning-fast, and versatile framework that ignites the power of large-scale data analytics. It is a powerful distributed data processing framework that helps us to analyze and derive insights from massive datasets. On the other hand, Cosmos DB is a globally

Read and write data from Cosmos DB to Spark Read More »

Optimize Spark dataframe write performance for JDBC

Leave a Comment / Performance Tuning, Spark / Gopal Krishna Ranjan / Apr 30, 2023 / apache spark, big data processing, jdbc

Apache Spark is a popular big data processing engine that is designed to handle large-scale data processing tasks. When it comes to writing data to JDBC, Spark provides a built-in JDBC connector that allows users to write data to various relational databases easily. We can write Spark dataframe to SQL Server, MySQL, Oracle, Postgres, etc.

Optimize Spark dataframe write performance for JDBC Read More »

Difference between Hadoop 1.x, Hadoop 2.x and Hadoop 3.x

1 Comment / Hadoop / Gopal Krishna Ranjan / Jan 30, 2023 / big data processing, Hadoop, MapReduce

We know that Apache Hadoop is a framework that allows us to perform data processing in a distributed way on very large datasets using commodity computers. That is why, this framework is highly scalable and can scale up from a single machine to thousands of machines. Most importantly, Hadoop is an open source and provides

Difference between Hadoop 1.x, Hadoop 2.x and Hadoop 3.x Read More »

Fill null with the next not null value – Spark Dataframe

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Dec 24, 2022 / big data processing, data analysis, Hadoop, scala

In this post, we discussed how to fill a null value with the previous not-null value in a Spark Dataframe. We have also discussed how to extract the non-null values per group from a spark dataframe. Now, in this post, we will learn how to fill a null value with the next available not-null value

Fill null with the next not null value – Spark Dataframe Read More »

Fill null with the previous not null value – Spark Dataframe

Leave a Comment / Scala, Spark / Gopal Krishna Ranjan / Dec 19, 2022 / big data processing, data analysis, Hadoop, scala

In the previous post, we discussed how to extract the non-null values per group from a spark dataframe. Now, in this post, we will learn how to fill the null values with the previous not-null value in a spark dataframe using the backfill method. To demonstrate this with the help of an example, we will

Fill null with the previous not null value – Spark Dataframe Read More »