In this post, we will be discussing the new features of Spark 2.0.0 and its installation in Hadoop 2.7.
We highly recommend our readers to go through the below posts on Spark, to get a clear idea of what Spark is and the reasons behind its popularity.
What’s New in Spark 2.X?
- Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset are now unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface.
- SparkSession: New entry point that replaces the old SQLContext and HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility.
- A new streamlined configuration API for SparkSession.
- Simpler, more performant accumulator API.
- A new, improved Aggregator API for typed aggregation in Datasets.
We have come across a new term called ‘Dataset’, here. What is this Dataset?
Starting in Spark 2.0, Dataset takes on two distinct APIs characteristics; a strongly-typed API and an untyped API. Conceptually, you can consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object. The Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class defined in Scala or a class in Java.
New Features Appended to Spark-SQL:
Spark 2.0 has substantially improved SQL functionalities with SQL2003 support. Spark SQL can now run all 99 TPC-DS queries. More prominently, the following has been improved:
- A native SQL parser that supports both ANSI-SQL as well as Hive QL
- Native DDL command implementations
- Subquery support, including
- Uncorrelated Scalar Subqueries
- Correlated Scalar Subqueries
- NOT IN predicate Subqueries (in WHERE/HAVING clauses)
- IN predicate subqueries (in WHERE/HAVING clauses)
- (NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)
- View canonicalization support
In addition, when building without Hive support, Spark SQL should have almost all the functionality as when building with Hive support, with the exception of Hive connectivity, Hive UDFs, and script transforms.
- Native CSV data source, based on Databricks’ Spark-CSV module
- Off-heap memory management for both caching and runtime execution
- Hive style bucketing support
- Approximate summary statistics using sketches, including approximate quantile, Bloom filter, and count-min sketch.
- Substantial (2 – 10X) performance speedups for common operators in SQL and DataFrames, via a new technique called whole stage code generation.
- Improved Parquet scan throughput through vectorization.
- Improved ORC performance.
- Many improvements in the Catalyst query optimizer for common workloads.
- Improved window function performance via native implementations for all window functions.
- Automatic file coalescing for native data sources.
New Features Appended to Spark’s Mlib:
ML Persistence: The DataFrames-based API provides near-complete support for saving and loading ML models and Pipelines in Scala, Java, Python, and R.
- MLlib in R: SparkR now offers MLlib APIs for generalized linear models, naive Bayes, k-means clustering, and survival regression. See this talk to learn more.
- Python: PySpark now offers many more MLlib algorithms, including LDA, Gaussian Mixture Model, Generalized Linear Regression, and more.
- Algorithms added to DataFrames-based API: Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler feature transformer.
New Features Appended to Spark R:
The largest improvement to SparkR in Spark 2.0 is user-defined functions. There are three user-defined functions: dapply, gapply, and lapply. The first two is for doing partition-based UDFs using dapply and gapply, e.g. partitioned model learning. The latter is for hyper-parameter tuning.
In addition, there are a number of new features:
- Improved algorithm coverage for Machine Learning in R, including naive Bayes, k-means clustering, and survival regression.
- Generalized linear models support more families and link functions.
- Save and load for all ML models.
- More DataFrame functionality: Window functions API, reader, writer support for JDBC, CSV, SparkSession
New Features Appended to Spark Streaming:
Spark 2.0 ships the initial experimental release for Structured Streaming, a high level streaming API built on top of Spark SQL and the Catalyst optimizer. Structured Streaming enables users to program against streaming sources and sinks using the same DataFrame/Dataset API as in static data sources, leveraging the Catalyst optimizer to incrementalize the query plans, automatically.
For the DStream API, the most prominent update is the new experimental support for Kafka 0.10.
Dependency, Packaging, and Operations in Spark:
There are a variety of changes to Spark’s operations and packaging process. They are:
- Spark 2.0 no longer requires a fat assembly jar for production deployment.
- Akka dependency has been removed, and as a result, user applications can now program against any versions of Akka.
- Spark now supports launching multiple Mesos executors in coarse-grained Mesos mode.
- Kryo version is bumped to 3.0.
Removals from Spark-1.x:
- Support for Hadoop 2.1 and earlier versions.
- The ability to configure closure serializer.
- TTL-based metadata cleaning.
- Semi-private class org.apache.spark.Logging. We suggest you use slf4j directly.
- Block-oriented integration with Tachyon (subsumed by file system integration).
- Methods deprecated in Spark 1.x.
- Methods on Python DataFrame that returned RDDs (map, flatMap, mapPartitions, etc). They are still available in dataframe.rdd field, e.g. dataframe.rdd.map.
- Less frequently used streaming connectors, including Twitter, Akka, MQTT, ZeroMQ.
- Hash-based shuffle manager.
- History serving functionality from standalone Master.
- For Java and Scala, DataFrame no longer exists as a class. As a result, data sources would need to be updated.
The following changes might need to be updated in existing applications that depend on the old behaviour or API.
- The default build is now using Scala 2.11 rather than Scala 2.10.
- In SQL, the floating literals are now parsed as decimal data type, rather than the double data type.
- Kryo version is bumped to 3.0.
- Java RDD’s flatMap and mapPartitions functions used to require functions returning Java Iterable. They are updated to require functions returning Java iterator, so the functions do not need to materialize all the data.
- Java RDD’s countByKey and countAprroxDistinctByKey now returns a map from K to java.lang.Long, rather than to java.lang.Object.
- When writing Parquet files, the summary files are not written by default. To re-enable it, users must set “parquet.enable.summary-metadata” to true.
- The DataFrame-based API (spark.ml) now depends on local linear algebra in spark.ml.linalg, rather than in spark.mllib.linalg. This removes the last dependencies of spark.ml.* on spark.mllib.*. (SPARK-13944) See the MLlib migration guide for a full list of API changes.
With the inclusion of Whole stage code generation technique in Spark, it has become 2-10x faster than its older versions. Here are the performance benchmarks of Spark-2.0
Spark 2.x Installation:
Step 1: To install Spark, we need to install Scala first as Spark is built on Scala.
To download Scala, type the below command.
Step 2: Extract the downloaded tar file using the below command.
tar -xvf scala-2.11.1.tgz
After extracting, you need to export the path in the bashrc file. After extracting, specify the path of Scala in .bashrc file.
Step 3: Download Spark -2.0.0 from here, by selecting your Hadoop version
After downloading, untar it by using the following command.
tar -xvzf spark-2.0.0-bin-hadoop2.7.tgz
Step 4: After extracting, export the path into bashrc file. Open your bashrc file and add the below lines.
export SPARK_HOME=Path to/spark-2.0.0-bin-hadoop2.7
After adding, Save and Close the file and type the command source .bashrc to update the file.
To store the data of the worker, you need to make one directory and you need to provide the path of the directory in export SPARK_WORKER_DIR. Here, we have created a directory called Sparkdata in the $HOME/work directory using the command mkdir sparkdata.
Spark 2.X Configuration:
In the Spark-2.x folder, you can see the conf folder. Within this folder, there is a file called spark-env.sh template. Make a copy of that file with the name spark-env.sh as shown in the below screenshot, using the below command.
cp spark-env.sh.template spark-env.sh
export JAVA_HOME=/home/kiran/jdk1.8.0_65(Here you need to provide the path of your JAVA_HOME)
HADOOP_CONF_DIR=$HADOOP_HOME/conf(Here you need to provide the path of your HADOOP_HOME)
export SCALA_HOME=$HOME/scala-2.10.4(Here you need to provide the path of your SCALA_HOME)
export SPARK_WORKER_DIR=$HOME/work/sparkdata(Here you need to give the path of the directory which you have created for spark worker)
After configuring everything, save and close the file and then move into the sbin folder of Spark-2.x and start the Spark daemons using the command start-all.sh, as shown in the below screenshot.
Now, you can see that Spark’s Master and Worker daemons are running. Next, you can start the Spark-shell by moving into the spark’s 2.x/bin directory and using the command ./spark-shell, as shown in the below screenshot.
In the above screenshot, you can see that Spark-Scala-shell has been started initialized? successfully. Now, you will be able to browse the Web UI as well. While starting the Spark shell, it will show the port number and IP address in which Spark web UI is available. Here, it shows like Web UI is available at http://192.168.0.124:4040. If we copy this in the browser, then we will able to check the Spark jobs in the web UI as shown in the below screenshot.
We hope this post has been helpful in understanding the new features of Spark 2.x and its installation. In the case of any queries, feel free to comment below and we will get back to you at the earliest.
Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.