Thursday, October 8, 2015

Resilient Distributed Dataset and Related Operations (Transformations and actions) in apache spark(part 1).

These post is about the RDD(Resilient Distributed Dataset) the backbone of apache spark.Apache spark is  memory based distributed computing framework an alternative to haoop for  some scenario.It is based on the immutable abstraction called as RDD.Unlike, similar distributed computing frameworks like hadoop,apache spark is just execution engine.It supports various types of input source like HDFS,Mongodb,cassandra etc.The core of apache spark is written in Scala.But it also supports Java and Python language for writing driver code.

The developers kept it as simple and as lightweight as possible for end user.User just have to focus on the main logic rather than focusing on the complete flow of algorithm.In runtime environment like mapreduce,user have to code considering mapreduce design pattern while writing the application.Even in case of MPI(Message passing interface),programmer takes care of communication between multiple processes and also takes care of sending data to processes.But in apache spark user focuses on driver code and required operation on data,run time engine will take care of parallelizing the computation and scheduling.
Now days apache spark is having number of use cases where we want to handle huge data and computation pattern is mainly iterative and demands user interaction.The most common use cases are the machine learning algorithm like K Means.Here the apache spark stays ahead of hadoop and MPI.
As I said the apache spark is distributed computing framework ,it must have supports for the traditional requirements of distributed computing frameworks like
1)Distributed Computation
2)Fault tolerance

It also supports requirements which are very common in todays big data era.

1)Persistence level.
2)Lazy evaluation.

Now,lets try to discuss them in detail.

1) Distributed Computation:

As the other distributed and parallel computing frameworks supports distributed computing,apache spark also supports the parallelization using RDD and partition.Partition are logical division of RDD and number of partitions are number of independent tasks that we want to execute in parallel.RDD is immutable i.e you can create new RDD but you can not modify it.The operation on RDD is creates new RDD.Its like string in java and tuple in python.

2)Fault tolerance:

As user creates the first RDD,he/she can execute number of operations on it,some of them causes creation of new RDDs and respective partitions.In other words we can say that RDD is just a metadata rather calling it as actual data.It contains information about how to compute it from other RDD or a data source.Thus the complete driver program may be having more than one RDD each having complete details and operation required to compute the RDD.If single node fails respective partition will be lost,but as RDD is having complete information required to compute it we can recompute the partition as and when needed.Thus it provides fault tolerance.

3)Lazy evaluation:
Most of the mining algorithm that works on big data require to shuffle and load huge size data that does not fit in main memory of single machine.Even we successfully club the memory of all machines to form single logical main memory it is not possible to accommodate this intermediate results completely using just main memory of all nodes.Thus apache spark postponed the computation of such intermediate results until required in some action.This is called as lazy evaluation.

4)Persistence level:
 Most of the algorithms like K Means(Centroid list) require the same RDD again and again.So instead of maintaining it in harddisk apache spark allows you to maintain it in memory by using persistence level concept.You can maintain it in hard disk,memory or both as required.

This a logical view of apache spark RDD concept.In next part we will discuss what are transformations and actions possible on RDD.