Big Data Hadoop & Spark

Running Spark Application on YARN Cluster

We know that Spark can be run on various clusters; It can be run on Mesos and Yarn by using its own cluster manager.

In this instructional blog post, we will be running Spark on Yarn. We will develop a Spark application and run it using the Yarn cluster Manager.

  • Refer to the following Spark-Java word count program.
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
import java.util.Arrays;
public class WordCount {
  private static final FlatMapFunction<String, String> WORDS_EXTRACTOR =
      new FlatMapFunction<String, String>() {
        @Override
        public Iterable<String> call(String s) throws Exception {
          return Arrays.asList(s.split(" "));
        }
      };
  private static final PairFunction<String, String, Integer> WORDS_MAPPER =
      new PairFunction<String, String, Integer>() {
        @Override
        public Tuple2<String, Integer> call(String s) throws Exception {
          return new Tuple2<String, Integer>(s, 1);
        }
      };
  private static final Function2<Integer, Integer, Integer> WORDS_REDUCER =
      new Function2<Integer, Integer, Integer>() {
        @Override
        public Integer call(Integer a, Integer b) throws Exception {
          return a + b;
        }
      };
  public static void main(String[] args) {
    if (args.length < 1) {
      System.err.println("Please provide the input file full path as argument");
      System.exit(0);
    }
    SparkConf conf = new SparkConf().setAppName("org.sparkexample.WordCount");
    JavaSparkContext context = new JavaSparkContext(conf);
    JavaRDD<String> file = context.textFile(args[0]);
    JavaRDD<String> words = file.flatMap(WORDS_EXTRACTOR);
    JavaPairRDD<String, Integer> pairs = words.mapToPair(WORDS_MAPPER);
    JavaPairRDD<String, Integer> counter = pairs.reduceByKey(WORDS_REDUCER);
    counter.saveAsTextFile(args[1]);
  }
}
  • Now we will be building a JAR file for this program by following the steps mentioned below.
    • Copy and paste the program in Eclipse by creating a Java project. After creating a Java Project, create a class with the name WordCount and paste the whole program.
    • Right click on the project —> Go to BuildPath —> Configure BuildPath
    • Open the Spark folder —>lib —> Spark Assembly 1.5.1 jar
    • After adding the JAR file, your errors will be cleared.
  • Now we will need to make a JAR file of that project to run in the cluster.
    • But making JAR in Spark is a little different from Hadoop. You need to install Maven and build your JAR file with Maven.
  • Follow the following steps to install Maven in your system.
    • Open the terminal and type the following:
      yum install maven

      It will take some time to download and install after the complete process.

    • We can check whether Maven is installed or not by using the following command:
      mvn
    • Check the version of installed maven by using the command
      mvn -version

      Hadoop

      val connection="jdbc:mysql://localhost/Acadgild"

Now, you have successfully installed Maven in your system.

  • Now we have to create a Maven package for our project. Steps to follow to create a Maven project.

Right click on the project —>Configure —> Convert to Maven project

After clicking “configure” in the drop-down menu, as depicted in the screenshot above, your project will be converted into Maven Project.

A pom.xml file will be created in your project.

  • Open the pom.xml file and move to the Dependencies tab and click on ADD to add the dependencies to your project.

  • Give the dependencies of the project as follows:

GroupId —> org.apache.spark

Artifact Id —> spark-core_2.10

Version —> 1.20

  • Click ok.

This will download some dependencies from the internet and it will be configured to your project automatically.

  • After following all the steps, you can check whether the dependencies have been downloaded by navigating the Maven Dependencies of your project.

With this, the dependencies required for building the JAR is over. A JAR is to be created now.

To create the jar follow the below steps

  • Open Terminal —> Move to the place where your project is present.
  • Check the path of your project.

Right Click on the project —> Properties

  • After clicking on properties, besides the console tab a properties tab will be opened.
    • In the properties tab the location will be mentioned. Copy that path.

  • In the terminal type the following:

cd <path>

You will be navigated to the folder where your project is saved.

  • Create the package by typing the following command:
mvn package

It will take some time to build the package. After the whole process is over, you can check the built JAR inside the target file.

Spak_Java_WordCount-0.0.1-SNAPSHOT.jar will be created. This is your JAR file.

The remaining work is to deploy the built JAR in the cluster.

  • To deploy the project in the Spark cluster, you need to specify a few things.
    • Open your spark installed folder and type the following:
./bin/spark-submit --class org.sparkexample.WordCount --master local[2] /<path to maven project>/target/spark-examples-1.0-SNAPSHOT.jar /<path to a demo test file> /<path to output directory>
./bin - move into the bin folder
/spark-submit - In client mode, the driver is launched directly within the spark-submitprocess which acts as a client to the cluster. The input and output of the application is attached to the console.
--class - you are giving a class file
org.sparkexample.WordCount --> name of the Main class(org.sparkexample is the package where word count is there)
--master yarn-cluster --> you are running the spark master with yarn-cluster. This line will run your spark program on Yarn cluster
/<path to maven project> -->Path where the jar file
/<path to a demo test file> --> path where your input file is present(As the program is running on YARN the file need to be in HDFS)
/<path to output directory>---> path where your output file want to be created inside HDFS

In this case, the syntax will look like this:

./bin/spark-submit --class WordCount --master yarn-cluster /home/kiran/workspace/Spark_Java_WordCount/target/Spak_Java_WordCount-0.0.1-SNAPSHOT.jar /input /Spark_on_yarn_wc_output

*Note: After “–Class” just WordCount (My main class file name) has been specified, as it has been created in the default package (I have not created any package; I used the default one) Hence, it was not required to type org.*

If you have created the class file inside any package, you need to specify your package also.

When you run the program with the command mentioned above, the console output will be as follows:

You will see final status: SUCCEEDED

  • After this, we can check for the output file inside HDFS in the particular directory. Below is the screenshot of the input file and the output files inside HDFS.

  • In the above screenshot, one can see that part files have been created in the output directory Spark_on_yarn_wc.

You can also check for the Spark application status in the Resource manager web UI. For that, open the Resource manager UI using the following address:

localhost:8088

In the above screenshot, you can see that Spark application has been run successfully on the Yarn cluster Manager.

We hope this blog helped you in understanding how to run Spark application on yarn cluster. Keep visiting our site www.acadgild.com for more updates on big data and other technologies.
Spark

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close