Free Shipping

Secure Payment

easy returns

24/7 support

  • Home
  • Blog
  • Running Spark Application locally in Windows

Running Spark Application locally in Windows

 July 15  | 0 Comments

Apache spark is a cluster computing framework which runs on Hadoop and handles different types of data. It is a one stop solution to many problems. Spark has rich resources for handling the data and most importantly, it is 10-20x faster than Hadoop’s MapReduce. It attains this speed of computation by its in-memory primitives. The data is cached and is present in the memory (RAM) and performs all the computations in-memory.

Spark’s rich resources has almost all the components of Hadoop. For example we can perform batch processing in Spark and real time data processing, without using any additional tools like kafka/flume of Hadoop. It has its own streaming engine called spark streaming.

We can perform various functions with Spark:

  • SQL operations: It has its own SQL engine called Spark SQL. It covers the features of both SQL and Hive.

  • Machine Learning: It has Machine Learning Library , MLib. It can perform Machine Learning without the help of MAHOUT.

  • Graph processing: It performs Graph processing by using GraphX component.

All the above features are in-built in Spark.

It can be run on different types of cluster managers such as Hadoop, YARN framework and Apache Mesos framework. It has its own standalone scheduler to get started, if other frameworks are not available.Spark provides the access and ease of storing the data,it can be run on many file systems. For example, HDFS, Hbase, MongoDB, Cassandra and can store the data in its local files system. You can learn more about the basics Apache spark from here.

Let us learn with the step by step process of running spark application in Windows.

In this post, we will execute the traditional Spark Scala Word Count program in windows using Eclipse.

  • For this, you need to have extracted the Spark tar file and Scala Eclipse IDE.

    • Here we have used the spark-1.5.2-bin-hadoop-2.6.0 version (you can use the later version as well).

    • Download the spark tar file from here.

  • After downloading, extract the file.

    • You will see a spark-1.5.2-bin-hadoop-2.6.0 folder.

  • Now open your Eclipse Scala IDE and create one Scala project as shown in the given below screenshot.

For creating a Scala project, go to:

File–>New–>other–>Scala wizards–>Scala project

    • Now you will get a prompt to provide the project name as shown in the given below screenshot.

      • We have named the project, “Spark_wc” you can give the project name of your choice
    • Now create a Scala Object. Inside the project, you can see a folder named srcRight click on src–>New–>Scala object as shown in the screenshot below.

       

    • Now you will be prompted to provide the object name as shown in the screenshot below.

      • We have given our object name as “Wordcount.”

         

Hadoop

  • Now click on Finish
    • You can see that a Scala object has been created in the src folder.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
object Wordcount {
    def main(args: Array[String]) = {
    //Start the Spark context
    val conf = new SparkConf()
      .setAppName("WordCount")
      .setMaster("local")
    val sc = new SparkContext(conf)
    //Read some example file to a test RDD
    val test = sc.textFile("inp") //need to provide the input path of the file
    test.flatMap { line => //for each line
      line.split(" ") //split the line in word by word.
    }
      .map { word => //for each word
        (word, 1) //Return a key/value tuple, with the word as key and 1 as value
      }
      .reduceByKey(_ + _) //Sum all of the value with same key
      .saveAsTextFile("output") //Save to a text file
    //Stop the Spark context
    sc.stop
  }
}
  • Now you need to add the spark-assembly JAR file to import the Spark packages.

    • Right click on src–>Build Path–>Configure build path–>Libraries–>Add External libraries–>Browse the spark-1.5.2-bin-hadoop-2.6.0 folder

    • Go to spark-1.5.2-bin-hadoop-2.6.0/lib/ and add the spark-assembly-1.5.2-hadoop-2.6.0.jar file.

      • Click on Apply as shown in the screenshot below.

 

*Note: If your Scala library is 2.11, change it to 2.10. Click on Add Library and select for the stable 2.10.6 library.

  • The Spark application can now be run in Windows.

    • We have already given the input and output paths of the files inside the program itself.

    • We have our input file in the project directory itself so our input file path is named just as inp

 

  • We named our output file path as just output, so this will create our output file in the project directory itself.

  • Now run the program, for that Right click on the main object–>Run As–>Scala application.

  • Your Scala application will start running and you will be able to see the logs in your Eclipse console.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/09/16 01:18:12 INFO SparkContext: Running Spark version 1.5.2
16/09/16 01:18:13 INFO SecurityManager: Changing view acls to: Kirankrishna
16/09/16 01:18:13 INFO SecurityManager: Changing modify acls to: Kirankrishna
16/09/16 01:18:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Kirankrishna); users with modify permissions: Set(Kirankrishna)
16/09/16 01:18:13 INFO Slf4jLogger: Slf4jLogger started
16/09/16 01:18:14 INFO Remoting: Starting remoting
16/09/16 01:18:14 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.56.1:60792]
16/09/16 01:18:14 INFO Utils: Successfully started service 'sparkDriver' on port 60792.
16/09/16 01:18:14 INFO SparkEnv: Registering MapOutputTracker
16/09/16 01:18:14 INFO SparkEnv: Registering BlockManagerMaster
16/09/16 01:18:14 INFO DiskBlockManager: Created local directory at C:\Users\Kirankrishna\AppData\Local\Temp\blockmgr-521f10c1-691c-49b2-b1e7-0eda52532f6c
16/09/16 01:18:14 INFO MemoryStore: MemoryStore started with capacity 972.5 MB
16/09/16 01:18:14 INFO HttpFileServer: HTTP File server directory is C:\Users\Kirankrishna\AppData\Local\Temp\spark-9c99d1f4-03fc-409b-a1b1-e362339ce7f9\httpd-012ac95a-5296-4847-bc72-e8851409b31d
16/09/16 01:18:14 INFO HttpServer: Starting HTTP Server
16/09/16 01:18:14 INFO Utils: Successfully started service 'HTTP file server' on port 60793.
16/09/16 01:18:14 INFO SparkEnv: Registering OutputCommitCoordinator
16/09/16 01:18:14 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/09/16 01:18:14 INFO SparkUI: Started SparkUI at http://192.168.56.1:4040
16/09/16 01:18:14 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/09/16 01:18:14 INFO Executor: Starting executor ID driver on host localhost
16/09/16 01:18:14 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60800.
16/09/16 01:18:14 INFO NettyBlockTransferService: Server created on 60800
16/09/16 01:18:14 INFO BlockManagerMaster: Trying to register BlockManager
16/09/16 01:18:14 INFO BlockManagerMasterEndpoint: Registering block manager localhost:60800 with 972.5 MB RAM, BlockManagerId(driver, localhost, 60800)
16/09/16 01:18:14 INFO BlockManagerMaster: Registered BlockManager
16/09/16 01:18:15 INFO MemoryStore: ensureFreeSpace(130448) called with curMem=0, maxMem=1019782103
16/09/16 01:18:15 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 972.4 MB)
16/09/16 01:18:15 INFO MemoryStore: ensureFreeSpace(14276) called with curMem=130448, maxMem=1019782103
16/09/16 01:18:15 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 972.4 MB)
16/09/16 01:18:15 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:60800 (size: 13.9 KB, free: 972.5 MB)
16/09/16 01:18:15 INFO SparkContext: Created broadcast 0 from textFile at Wordcount.scala:15
16/09/16 01:18:16 WARN : Your hostname, ACD-KIRAN resolves to a loopback/non-reachable address: fe80:0:0:0:0:5efe:c0a8:ae3%net5, but we couldn't find any external IP address!
16/09/16 01:18:17 INFO FileInputFormat: Total input paths to process : 1
16/09/16 01:18:17 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/09/16 01:18:17 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/09/16 01:18:17 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/09/16 01:18:17 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/09/16 01:18:17 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/09/16 01:18:17 INFO SparkContext: Starting job: saveAsTextFile at Wordcount.scala:24
16/09/16 01:18:17 INFO DAGScheduler: Registering RDD 3 (map at Wordcount.scala:20)
16/09/16 01:18:17 INFO DAGScheduler: Got job 0 (saveAsTextFile at Wordcount.scala:24) with 1 output partitions
16/09/16 01:18:17 INFO DAGScheduler: Final stage: ResultStage 1(saveAsTextFile at Wordcount.scala:24)
16/09/16 01:18:17 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16/09/16 01:18:17 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16/09/16 01:18:17 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at Wordcount.scala:20), which has no missing parents
16/09/16 01:18:17 INFO MemoryStore: ensureFreeSpace(4008) called with curMem=144724, maxMem=1019782103
16/09/16 01:18:17 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.9 KB, free 972.4 MB)
16/09/16 01:18:17 INFO MemoryStore: ensureFreeSpace(2281) called with curMem=148732, maxMem=1019782103
16/09/16 01:18:17 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 972.4 MB)
16/09/16 01:18:17 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:60800 (size: 2.2 KB, free: 972.5 MB)
16/09/16 01:18:17 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
16/09/16 01:18:17 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at Wordcount.scala:20)
16/09/16 01:18:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/09/16 01:18:17 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 2148 bytes)
16/09/16 01:18:17 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/09/16 01:18:17 INFO HadoopRDD: Input split: file:/C:/Users/Kirankrishna/workspace/Spark_wc/inp:0+84
16/09/16 01:18:17 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2253 bytes result sent to driver
16/09/16 01:18:17 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 107 ms on localhost (1/1)
16/09/16 01:18:17 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/09/16 01:18:17 INFO DAGScheduler: ShuffleMapStage 0 (map at Wordcount.scala:20) finished in 0.118 s
16/09/16 01:18:17 INFO DAGScheduler: looking for newly runnable stages
16/09/16 01:18:17 INFO DAGScheduler: running: Set()
16/09/16 01:18:17 INFO DAGScheduler: waiting: Set(ResultStage 1)
16/09/16 01:18:17 INFO DAGScheduler: failed: Set()
16/09/16 01:18:17 INFO DAGScheduler: Missing parents for ResultStage 1: List()
16/09/16 01:18:17 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at Wordcount.scala:24), which is now runnable
16/09/16 01:18:18 INFO MemoryStore: ensureFreeSpace(127704) called with curMem=151013, maxMem=1019782103
16/09/16 01:18:18 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 124.7 KB, free 972.3 MB)
16/09/16 01:18:18 INFO MemoryStore: ensureFreeSpace(42757) called with curMem=278717, maxMem=1019782103
16/09/16 01:18:18 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 41.8 KB, free 972.2 MB)
16/09/16 01:18:18 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:60800 (size: 41.8 KB, free: 972.5 MB)
16/09/16 01:18:18 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:861
16/09/16 01:18:18 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at Wordcount.scala:24)
16/09/16 01:18:18 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
16/09/16 01:18:18 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1901 bytes)
16/09/16 01:18:18 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
16/09/16 01:18:18 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/09/16 01:18:18 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 3 ms
16/09/16 01:18:18 INFO FileOutputCommitter: Saved output of task 'attempt_201609160118_0001_m_000000_1' to file:/C:/Users/Kirankrishna/workspace/Spark_wc/output/_temporary/0/task_201609160118_0001_m_000000
16/09/16 01:18:18 INFO SparkHadoopMapRedUtil: attempt_201609160118_0001_m_000000_1: Committed
16/09/16 01:18:18 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 2080 bytes result sent to driver
16/09/16 01:18:18 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at Wordcount.scala:24) finished in 0.168 s
16/09/16 01:18:18 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 168 ms on localhost (1/1)
16/09/16 01:18:18 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/09/16 01:18:18 INFO DAGScheduler: Job 0 finished: saveAsTextFile at Wordcount.scala:24, took 0.431022 s
16/09/16 01:18:18 INFO SparkUI: Stopped Spark web UI at http://192.168.56.1:4040
16/09/16 01:18:18 INFO DAGScheduler: Stopping DAGScheduler
16/09/16 01:18:18 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/09/16 01:18:18 INFO MemoryStore: MemoryStore cleared
16/09/16 01:18:18 INFO BlockManager: BlockManager stopped
16/09/16 01:18:18 INFO BlockManagerMaster: BlockManagerMaster stopped
16/09/16 01:18:18 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/09/16 01:18:18 INFO SparkContext: Successfully stopped SparkContext
16/09/16 01:18:18 INFO ShutdownHookManager: Shutdown hook called
16/09/16 01:18:18 INFO ShutdownHookManager: Deleting directory C:\Users\Kirankrishna\AppData\Local\Temp\spark-9c99d1f4-03fc-409b-a1b1-e362339ce7f9
16/09/16 01:18:18 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

After the completion, you can check for the output directory by refreshing the project folder.

You can see that an output directory has been created and inside that result of the wordcount program is present in the part-r-00000 file as shown in the below screenshot.

In the above screenshot, you can see the output of the Word Count program. We have now successfully executed a Spark application in windows.

We hope this blog helped you in running a Spark application in Windows. Keep visiting our site www.acadgild.com for more updates on big data and other technologies.

>