Thursday, February 18, 2016

Finding most frequent word using apache spark.

Hello all this code is for finding the most frequent word from corpus of text file using apache spark.I am assuming that the readers are aware of basic rdd concepts.Rdd provides the transformations and actions for writing the logic.It also provides api for creating rdd from various data source(hdfs,mongodb etc).

The flow of algorithm is as follow.
1.Load data from hdfs
  
 from pyspark import SparkContext  
 sc=SparkContext()  
 rddD=sc.textFile("hdfs://localhost:54310/data/hadoop/spark-input/wc.txt")   

2.Calculate the total occurrence of every word using transformations.

 flM=rddD.flatMap(lambda x: x.split()).map(lambda x: (x.lower(),1)).reduceByKey(lambda x,y:x+y)  

3.Subtract the stop words from the rdd.
 As the normal text corpus has very high proportion of stop words we have to eliminate them.To achieve this we have to create rdd of stop words and we have to subtract it from the rddD.The list of stop words can be maintained as file in hdfs.We can load this file to create the respective rdd directly.
 rddS=sc.textFile("hdfs://localhost:54310/data/hadoop/spark-input/stopwords_en.txt")  
 flS=rddS.flatMap(lambda x: x.split("\n")).map(lambda x: (x,1)).reduceByKey(lambda x,y:x+y)  
 flM=flM.subtractByKey(flS)  
4.Find the most frequent word.It can be achieved using the reduce action on rdd.The logic is to compare the words with there frequency as criteria of comparison.
 def comp(x,y):  
         if x[1]<y[1]:  
              return y  
         else:  
              return x  
 maxOcc=flM.reduce(comp)  







maxOcc will have the most frequent word with its occurrence.

You can get this source code at https://github.com/shaileshcheke/spark-examples.git also.
     

Friday, February 12, 2016

MPI clustering using beaglebone and hostmachine.


The page explains how to install 2 node mpi cluster,where one node is my laptop with Fedora 19 and C2D processor,other node is beaglebone black with debian linux and arm processor.(as OpenMPI supports heterogeneous nodes.) I am using openmpi-1.6.5 for installing cluster.
1.Plugin beaglebone to hostmachine using usb cable.It will configure passwordless ssh between the root user of host and that of beaglebone automatically.You can check it by executing command ssh 192.168.7.2. 192.168.7.2 is ip address assigned to usb interface of the beaglebone.It can be changed by modifying /etc/network/interfaces file.For my beaglebone device the usb0 interface is configured as given below.

iface usb0 inet static

address 192.168.7.2

netmask 255.255.255.0

network 192.168.7.0

gateway 192.168.7.1

You can configure this setting for ip address that you want.

Once Plugged ,host machine will have the 192.168.7.1 as ip address for other end of the usb connection.

4.On master machine open terminal.
$su
$mkdir /opt/mpi

Open new terminal
$su
$ssh 192.168.7.2

You will jump to beaglebone terminal with root user logged in.

$mkdir /opt/mpi

5.Download openmpi-1.6.5.tar.gz on master machine.

6.On master machine open terminal.
$su
$scp {path to }/openmpi-1.6.5.tar.gz root@192.168.7.2:/root/
This command copies the openmpi tar to beaglebone.
7.On master machine extract the tar file.
$tar -xvf openmpi-1.6.5.tar.gz
$cd openmpi-1.6.5
$./configure --prefix=/opt/mpi –exec-prefix=/opt/mpi
$make install
On beaglebone follow the same steps.
8.Now installation is completed.We have to set the enviornment varibales to reflect the changes.
On master machine open terminal
$su
$vi ~/.bashrc
write following lines inside it
export PATH=$PATH:/opt/mpi/bin
export LD_LIBRARY_PATH=$ LD_LIBRARY_PATH:/opt/mpi/lib
Now
$ssh 192.168.7.2
and follow the above steps.

9.We are done with base work.Now the next few steps makes difference between the normal beowulf cluster and beaglebone-host cluster.
In my case the hostmachine is fedora system.Fedora system required configuration (iptables )for incoming connection from beaglebone to fedora machine.Follow following steps to configure iptables for incoming traffic on fedora (host machine)

$su
$iptables -I INPUT -s 192.168.7.2 -p tcp -j ACCEPT

10.Now lets check whether the compute cluster is configured.In following code every mpi process spawned will print hostname and rank.

/* Hello_c.c*/
#include <stdio.h>
#include "mpi.h"
#include <unistd.h>
int main(int argc, char* argv[])
{
int rank, size;
char* hostname=(char *)malloc(sizeof(char)*10);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
gethostname(hostname,10);
printf("Hello, world, I am %d of %d my host name is %s\n ", rank, size,hostname);
MPI_Finalize();
return 0;
}

To compile the code execute

$mpicc Hello_c.c


11.Now create a text file,lets say “hostaddress” and add ip address of host and beaglebone,one ip address per line.

12.When you submit the job openmpi runtime enviornment will go through all the interfaces up to locate the machines mentioned in hostaddress.But some times because of this approach it fails to locate the machine.So to avoid it,we have to configure the mpi to use specified interface.To do it just execute
$export HYDRA_IFACE=enp0s29f7u3


,where the enp0s29f7u3 is usb interface name with ip address 192.168.7.1 for hostside.

13.Done with configuration lets submit the job to the cluster.

$mpirun -machinefile hostaddress -np 10 ./a.out

It must work.