Tag : big data processing


Read and write data to SQL Server from Spark using pyspark

Apache Spark is a very powerful general-purpose distributed computing framework. It provides a different kind of data abstractions like RDDs, DataFrames, and DataSets on top of the distributed collection of the data. Spark is highly scalable Big data processing engine which can run on a single cluster to thousands of clusters. To follow this exercise, we can install Spark on our local machine and can use Jupyter notebooks to write code in an interactive mode. In this post “Read and write data to SQL Server from Spark using pyspark“, we are going to demonstrate how we can use Apache Spark to read and write data to a SQL Server table.

Read SQL Server table to DataFrame using Spark SQL JDBC connector – pyspark

Spark SQL APIs can read data from any relational data source which supports JDBC driver. We can read the data of a SQL Server table … More


RDD, DataFrame, and DataSet – Introduction to Spark Data Abstraction

Apache Spark is a general purpose distributed computing engine used for Big Data processing – Batch and stream processing. It provides high level APIs like Spark SQL, Spark Streaming, MLib, and GraphX to allow interaction with core functionalities of Apache Spark. Spark also facilitates several core data abstractions on top of the distributed collection of data which are RDDs, DataFrames, and DataSets. In this post, we are going to discuss these core data abstractions available in Apache Spark.

Spark Data Abstraction

The data abstraction in Spark represents a logical data structure to the underlying data distributed on different nodes of the cluster. The data abstraction APIs provides wide range of transformation methods (like map(), filter(), etc) which are used to perform computations in a distributed way. However, in order to execute these transformations, we need to call an action method like show(), collect(), etc.

Let’s have a … More