Big Data Hadoop & Spark - Advanced

Building Inverted Index in MapReduce

Inverted index is index data structure for storing mapping results from content, such as words or numbers, to its locations in a database file or in a document or a set of documents. Most of the text searching systems rely on inverted index to search the documents that contains a given word or a term.

In this post, we will implement the MapReduce application to build an inverted index to generate the list of words in the files and the set of files that contains each terms and the word frequency in each of the files.

The prerequisites to build the inverted index are as follows:

  • Hadoop should be up and running.

  • Must contain a directory in the HDFS containing one or more text files.(content of files can be any text data).

Now, let’s analyze the steps required to build the inverted index.

As an input to the program, we will get the list of files with text data in it.

File1 → text data

File2 → text data

Output of the program should be as shown below:

Word1 [file1→count of the word1 in file1, file2→count of the word1 in file2 ………]

Word2 [file1→count of the word2 in file1, file2→count of the word2 in file2 ………]

For executing this program, we are taking two sample text files as shown in the below screen shot

Code for Mapper:

/* This program uses Text Input format which will emit byte offset as key and entire line as value to Mapper as input. We need to split the lines and get individual words from . 

** As an output of mapper, we need to emit word as key and filename in which is present as value */

public class Map extends Mapper<LongWritable,Text,Text,Text> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException,InterruptedException
{
/*Get the name of the file using context.getInputSplit()method*/
String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
String line=value.toString();
//Split the line in words
String words[]=line.split(" ");
for(String s:words){
//for each word emit word as key and file name as value
context.write(new Text(s), new Text(fileName));
}
}
}

Hadoop

Reducer Code:

/* Reducer will receive Word as key and list of file names in which the word is present as value.

**If the word has occurred N times in the file then file name will be present N in the list of values from the Mapper*/

public static class Reduce extends
Reducer<Text, Text, Text, Text> {
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
/*Declare the Hash Map to store File name as key to compute and store number of times the filename is occurred for as value*/
HashMap m=new HashMap();
int count=0;
for(Text t:values){
String str=t.toString();
/*Check if file name is present in the HashMap ,if File name is not present then add the Filename to the HashMap and increment the counter by one , This condition will be satisfied on first occurrence of that word*/
if(m!=null &&m.get(str)!=null){
count=(int)m.get(str);
m.put(str, ++count);
}else{
/*Else part will execute if file name is already added then just increase the count for that file name which is stored as key in the hash map*/
m.put(str, 1);
}
}
/* Emit word and [file1→count of the word1 in file1 , file2→count of the word1 in file2 ………] as output*/
context.write(key, new Text(m.toString()));
}
}

Driver Program:

public static void main(String[] args) throws Exception {
Configuration conf= new Configuration();
Job job = new Job(conf,"UseCase1");
//Defining the output key and value class for the mapper
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setJarByClass(InvertedIndex.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
//Defining the output value class for the mapper
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, outputPath);
//deleting the output path automatically from hdfs so that we don't have delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath);
//exiting the job only if the flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Steps to Execute

-Copy the two files in a folder and copy that folder into HDFS.

-Build the Jar file for the above program

-Give the command to run the jar file. In our case the command is as below

hadoop jar /home/kiran/jars/invert.jar /invert /invert_output

We have built the jar file with name invert.jar and we have copied the two input files into the folder invert and we have given the output file as invert_output.

Please follow the below screen shots to check the output.

Executed Output:

Hope this post has been helpful in understanding how to build an inverted index in MapReduce. In case of any questions doubts on this topic, feel free to comment below and we will get back to you at the earliest.

Keep visiting our site www.acadgild.com for more updates on Bigdata and other technologies.

Hadoop

4 Comments

  1. how to sove this issue ?
    org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to org.apache.hadoop.mapred.FileSplit

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close