Thursday, February 18, 2016

Finding most frequent word using apache spark.

Hello all this code is for finding the most frequent word from corpus of text file using apache spark.I am assuming that the readers are aware of basic rdd concepts.Rdd provides the transformations and actions for writing the logic.It also provides api for creating rdd from various data source(hdfs,mongodb etc).

The flow of algorithm is as follow.
1.Load data from hdfs
  
 from pyspark import SparkContext  
 sc=SparkContext()  
 rddD=sc.textFile("hdfs://localhost:54310/data/hadoop/spark-input/wc.txt")   

2.Calculate the total occurrence of every word using transformations.

 flM=rddD.flatMap(lambda x: x.split()).map(lambda x: (x.lower(),1)).reduceByKey(lambda x,y:x+y)  

3.Subtract the stop words from the rdd.
 As the normal text corpus has very high proportion of stop words we have to eliminate them.To achieve this we have to create rdd of stop words and we have to subtract it from the rddD.The list of stop words can be maintained as file in hdfs.We can load this file to create the respective rdd directly.
 rddS=sc.textFile("hdfs://localhost:54310/data/hadoop/spark-input/stopwords_en.txt")  
 flS=rddS.flatMap(lambda x: x.split("\n")).map(lambda x: (x,1)).reduceByKey(lambda x,y:x+y)  
 flM=flM.subtractByKey(flS)  
4.Find the most frequent word.It can be achieved using the reduce action on rdd.The logic is to compare the words with there frequency as criteria of comparison.
 def comp(x,y):  
         if x[1]<y[1]:  
              return y  
         else:  
              return x  
 maxOcc=flM.reduce(comp)  







maxOcc will have the most frequent word with its occurrence.

You can get this source code at https://github.com/shaileshcheke/spark-examples.git also.
     

No comments:

Post a Comment