Saturday, July 18, 2015

Custom RDD in apache spark using python(Using custom classes for creating RDD).

RDD is resilient distributed dataset.It is an abstraction provided by apache spark for distributed operation on data.The post is about how to  write the custom RDD in python(pyspark).

Following code shows simple python class with single data member.

 class Data(object):  
           def __init__(self,data):  
                     self.data=data  
           def increment(self):  
                     self.data += 1  
           def printData(self):  
                     print self.data  
Currently,apache spark does not support the class definition to be in the current script because of pickling issue.That is why we have to maintain the class definition in different module(say mod.py). Following code list gives how to use the above mentioned class to create the custom RDD.

#code.py file.
from pyspark import SparkContext  
from mod import Data as Data  
sc=SparkContext()  
#creating the rdd from list of integers.
rdd=sc.parallelize([1,2,3,4])  
#Function to transform simple list rdd to custom rdd
def func(item):  
           d=Data(item)  
           return d  
#Function to cal increment on each element of rdd1
def func2(item):  
           item.increment()  
           return item  
#actual list rdd to custom rdd conversion 
rdd1=rdd.map(func)  
print rdd1.count()  
#increment data field of each item from rdd1 by calling member function increment 
rdd1.foreach(func2) 
for i in rdd1.collect():  
           i.printData()  
To execute the code just type spark-submit --py-files mod.py code.py.

No comments:

Post a Comment