Saturday, July 18, 2015

Custom RDD in apache spark using python(Using custom classes for creating RDD).

RDD is resilient distributed dataset.It is an abstraction provided by apache spark for distributed operation on data.The post is about how to  write the custom RDD in python(pyspark).

Following code shows simple python class with single data member.

 class Data(object):  
           def __init__(self,data):  
           def increment(self):  
            += 1  
           def printData(self):  
Currently,apache spark does not support the class definition to be in the current script because of pickling issue.That is why we have to maintain the class definition in different module(say Following code list gives how to use the above mentioned class to create the custom RDD. file.
from pyspark import SparkContext  
from mod import Data as Data  
#creating the rdd from list of integers.
#Function to transform simple list rdd to custom rdd
def func(item):  
           return d  
#Function to cal increment on each element of rdd1
def func2(item):  
           return item  
#actual list rdd to custom rdd conversion  
print rdd1.count()  
#increment data field of each item from rdd1 by calling member function increment 
for i in rdd1.collect():  
To execute the code just type spark-submit --py-files

No comments:

Post a Comment