Avro and Parquet are the file formats that are introduced within Hadoop ecosystem. Avro acts as a data serialize and DE-serialize framework while parquet acts as a columnar storage so as to store the records in an optimized way. We will discuss on how to work with AVRO and Parquet files in Spark. In this tutorial, we will show you a demo on how to load Avro and Parquet data into Spark and how to write the data as Avro and Parquet files in spark.
Avro and Parquet in Spark
Let’s see how to work with Avro and Parquet files in spark. We will start our discussion with avro first.
What is Avro?
Avro is a RPC(Remote procedure call) and a data serialization framework developed along with hadoop.Avro is also very much preferred for serializing the data in Big data frameworks.
Avro became one of the mostly used data serialization framework because of its language neutrality. Due to the lack of language portability in the Hadoop writable classes, avro became a natural choice. It can handle multiple data formats and they can be further processed by multiple languages
It uses JSON(Java script Object Notation) for defining the data types, protocols and serializes the data in a compact binary format. Its primary use is in Hadoop and Spark. Here it can provide both a serialization format for persistent data, and a wire format for the need of communication between Hadoop nodes, and from the client programs to the Hadoop services.
Avro purely relies on a schema. When Avro data is read, the schema that is used for writing it is always present. This permits each datum to be written with no per-value overheads, making the serialization small and faster. This method also facilitates using it with dynamic scripting languages, since the data is together with its schema, it is fully self-describing.
When Avro data is stored in a file, its schema is also stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schema’s are present.
When Avro is used in RPC, the client and server exchange schema’s in the connection handshake. (This can be optimized so that, for most calls, no schema’s are transmitted.) Since both client and server have the other’s full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.
Avro schema are defined with JSON . This facilitates implementation in languages that already have JSON libraries. Using Avro, we can convert unstructured and semi-structured data into properly structured data using its schema.
Now we will see how to load Avro data into Spark, we already have an Avro file which is built using Hive. You can refer to the blog working on Avro in Hive to know the procedure.
De-serialization with Avro in Spark
Converting an Avro file to a normal file is called as De-serialization. As Avro relies on the schema, it can be termed as a structured data. We will prefer SparkSql to work on these files.
For loading Avro files, you need to download the data bricks spark_avro jar file, you can download the jar file from here.
After downloading the jar file, you will need to add your classpath. To import the jar file into Spark-shell, use :cp jar_file name
Now after successful import, you can load the Avro data using the sqlContext as follows:
val df = sqlContext.read.format("com.databricks.spark.avro").load("file:///home/kiran/Documents/000000_0.avro")
You can see the same in then below screen shot.
You can see that in the above screen shot, Avro data has been successfully loaded as a data frame. Now you can perform all the data frame operations on this data.
You can check the contents of the data frame by using df.show query.
In the above screen shot, you can see the contents of the data frame. These are the contents of an olympic dataset.
Data Serialization with Avro in Spark
Converting a normal data into an Avro file is called a Serialization. We will now see how to do this.
Now we will see how to save a data frame as an Avro file. If you have already created a data frame, then you can easily save it as an Avro file.
val df1 = df.write.format("com.databricks.spark.avro").save("/home/kiran/Desktop/df_to_avro")
Now in the specified path, you can see the file with .avro extension. The same is shown in the below screen shot.
You can see the files with .avro extension in the above screen shot. You can see the serialized content inside those files.
Working on Parquet files in Spark
Parquet is an open source file format for Hadoop/Spark and other Big data frameworks. Parquet stores nested data structures in a flat columnar format. Compared to any traditional approach where the data is stored in a row-oriented format, Parquet is more efficient in the terms of performance and storage.
Parquet can be used in any Hadoop ecosystems such as Spark, Hive, Impala, and Pig.
Parquet stores the binary data in a column-oriented way, where the values of each and every column are organized so that all the columns are adjacent, enabling better compression rate. It is especially good for the queries which read columns from a “wide” (with many columns) table since only needed columns are read and the IO(Input/Output) is minimized.
When the data is processed with big data frameworks such as hadoop or spark, the cost of storing the data is more if the data is to be stored in HDFS as HDFS has replication factor, minimum 3 copies of each file will be maintained for fault tolerance. So automatically the storage cost will increase, along with storage, processing cost will also increase as the data comes into CPU, Network IO etc., So to minimize all these costs, parquet is one of the choice for developers which efficiently stores the data and thereby increasing the performance.
To work on Parquet files, we do not need to download any external jar files. Spark by default has provided support for Parquet files.
We will now first convert the above data frame into a Parquet file. It is very simple in Spark. Just save the file as a Parquet file.
val data = df.saveAsParquetFile("/home/kiran/Desktop/df_to_paraquet")
Now in the specified path, you can see the files with .paraquet extension. The same is as shown in the below screen shot.
In the parquet files, you can see the binary content.
Now we will see how to load the Parquet data into Spark. It is quite simple. Just load the data as a Parquet file.
val df = sqlContext.parquetFile("/home/kiran/Desktop/df_to_paraquet")
Now the Parquet data will be successfully converted as a data frame. You can now perform all the data frame operation on this data. The same is as shown in the below screen shot.
In the above screen shot, you can see the contents of the Parquet file.
We hope this blog helps you in understanding how to work on Avro and Parquet files in Spark. In our next blog, we will be discussing on the other file formats in Spark. Keep visiting our site www.acadgild.com for more updates on Big data and other technologies.