In this tutorial, you will learn about the various file formats in Spark and how to work on them. Before getting into the file formats in Spark, let us see what is Spark in brief.
What is Apache Spark?
Apache Spark is a cluster computing framework that runs on Hadoop and handles different types of data. It is a one stop solution to many problems as Spark has rich resources for handling the data and most importantly, it is 10-20x faster than Hadoop’s MapReduce. It attains this speed of computation by its in-memory primitives. The data is cached and is present in the memory (RAM) and performs all the computations in-memory.
Spark’s rich resources have almost all the components of Hadoop. For example, we can perform batch processing in Spark and real-time data processing, without using any additional tools like Kafka/flume of Hadoop. It has its own streaming engine called Spark streaming.
We can perform various functions with Spark:
1. SQLoperations: It has its own SQL engine called Spark SQL. It covers the features of both SQL and Hive.
2. Machine Learning: It has Machine Learning Library, MLib. It can perform Machine Learning without the help of MAHOUT.
3. Graph processing: It performs Graph processing by using GraphX component.
All the above features are in-built in Spark.
It can be run on different types of cluster managers such as Hadoop, YARN framework, and Apache Mesos framework. It has its own standalone scheduler to get started if other frameworks are not available. Spark provides the access and ease of storing the data, it can be run on many file systems. For example, HDFS, Hbase, MongoDB, Cassandra and can store the data in its local files system.
For more details on Spark, you can also refer to our beginner’s guide from the below link:
Spark supports all the file formats supported by Hadoop. There are many benefits of using appropriate file formats.
1. Faster accessing while reading and writing
2. More compression support
3. Schema oriented
Now we will see the file formats supported by Spark. Spark supports all the file formats supported by Hadoop using Hadoop InputFormat, TextFiles, and SequenceFiles.
You can refer to our blog on HadoopInput and OutputFormats in Spark
Now let us see how to load text files into Spark.
Loading textFiles into an RDD in Spark
Spark provides textFile function in its context to load a text file. It is as follows:
scala> val data = sc.textFile(“file:///home/kiran/Desktop/olympix_data.csv”)
data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD at textFile at <console>:25
res6: Array[String] = Array(Michael Phelps,23,United States,2008,8/24/2008,Swimming,8,0,0,8, Michael Phelps,19,United States,2004,8/29/2004,Swimming,6,0,2,8, Michael Phelps,27,United States,2012,8/12/2012,Swimming,4,2,0,6, Natalie Coughlin,25,United States,2008,8/24/2008,Swimming,1,2,3,6, Aleksey Nemov,24,Russia,2000,10/1/2000,Gymnastics,2,1,3,6)
We have a data set called olympix_data.csv which contains the olympics data. We have loaded that dataset successfully into the variable data. After that we have used take function to take out the first 5 rows and the output is as shown above.
Saving an RDD as a Sequence File in Spark
Now we will see how to save an RDD as a sequence file in spark.
Sequence file is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats. It is also worth noting that, internally, the temporary outputs of maps are stored using SequenceFile
So we need to create a pair RDD containing the key and value in spark to save the RDD as a sequence File.
It can be done as follows:
scala> val pairs:RDD[(String,String)] = data.map(x=>x.split(“,”)).map(x => (x(1).toString(),x(2).toString()))
pairs: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD at map at <console>:27
So first we have created a pair RDD by taking the first and second column as key and value and we are saving it as a sequenceFile. In the specified output path, we can see the binary files created.
You can see the below screen shot as well.
Now we will load the same sequence File into Spark as follows.
Loading sequenceFiles into an RDD in Spark
As the sequence files are key value pairs, we need to load them also as key value pairs itself. So we will load the sequence files using paired RDD’s in spark. It can be done as follows.
res11: Array[(String, String)] = Array((23,United States), (19,United States), (27,United States), (25,United States), (24,Russia))
scala> val data1:RDD[(String,String)] = sc.sequenceFile(“/home/kiran/Desktop/rdd_to_seq”)
data1: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD at sequenceFile at <console>:25
res12: Array[(String, String)] = Array((23,United States), (19,United States), (27,United States), (25,United States), (24,Russia))
pairs is the rdd name which consists the key value pairs and we have saved it as a sequence file and again we have loaded the same sequence file using the name data1 and we have taken the first 5 records.
So we have successfully loaded the sequence file using the spark. Now we will save the same using a textFile.
Saving RDD as a TextFile in Spark
We can save an RDD as a TextFile as follows:
Once it is saved successfully, you can check for the data in the specified location.
So we have successfully seen the loading and storing of data using Text and Sequence files in Spark.
You can refer the blog below for working on Avro and Paraquet file formats in Spark.
We hope this blog helped you in understanding how to work on the text, sequence files and Hadoop InputFormats in Spark. Keep visiting our site www.acadgild.com for more updates on BigData and other technologies.