MapReduce

Difference between Hadoop 1.x, Hadoop 2.x and Hadoop 3.x

We know that Apache Hadoop is a framework that allows us to perform data processing in a distributed way on very large datasets using commodity computers. That is why, this framework is highly scalable and can scale up from a single machine to thousands of machines. Most importantly, Hadoop is an open source and provides […]

Difference between Hadoop 1.x, Hadoop 2.x and Hadoop 3.x Read More »

Big Data processing using Apache Spark – Introduction

What is Spark Apache spark is an open source general purpose distributed cluster computing framework. It is an unified computing engine for big data processing. Spark is designed for lightning fast cluster computing especially for fast computation. An application can run up to 100 times faster than Hadoop MapReduce using Spark in-memory cluster computing. Also,

Big Data processing using Apache Spark – Introduction Read More »

Understanding Map join in Hive

Apache Hive is a big data query language which is used to read, transform and write large datasets in a distributed environment. It has a SQL like syntax which gets translated into a MapReduce job in order to execute on Hadoop clusters. In Hadoop ecosystem, we use Hive for batch processing to extract, transform and

Understanding Map join in Hive Read More »