R has a simple and easy-to-use syntax and supports a huge library of capabilities; this makes it a top Data Science language. But, the biggest limitation of R is the amount of data it can process. Apache Spark has fast parallel computing capabilities that can extend over hundreds of nodes. On the other, it is an easy-to-use program.
Together R with Spark can provide a distributed DataFrame implementation that supports operations like selection, filtering, aggregation, etc.
Readers who want to install R in their systems can follow our blog: Beginner’s Guide to R.
In this blog post, we will learn how to integrate R with Spark. We will also insert a TABLE in Spark and perform filtration operations.
Integrating R with Spark
Installing the Required Packages and Software
To check if the package has been installed correctly, we will load the Sparklyr package by using the following command:
Install Spark Via R Studio
To install Spark, run the following command:
spark_install(version = “1.6.2”)
It may take some time to download the file and install it.
Once done, the “Installation Complete” message will be shown. Refer to the screenshot below:
To upgrade to the latest version of Sparklyr, run the following command and restart your R session:
Note: It may take some time to download the dependency files and install them. Do not close the R studio. In my case, the complete Rtools was also installed.
Reboot the Rconsole, and remember to save your work.
Connecting to Spark
Spark has both the local instance and remote engine. Here, we will connect to the local instance via the spark_connect function.
A remote dplyr data source is provided to the Spark cluster in return of the Spark connection.
After sparklyr is installed, you may find a new tab for Spark. Inside of which, you will find a sub-tab for New Connection. Now establish a new connection.
Refer to the following image and keep the settings to whatever have been displayed below.\
Now click on Connect.
This will initiate 3 commands in the R console and a connection will be established. Wait for a while for the connection to get established. Do not close your Rconsole.
After the connection is successful, you will find that the new connection tab no more pops up. Refer to the screenshot below:
When connecting, you might get an error: “Path Not Found.” This shows up because winutils.exe is not in place. This is a file which will help Hadoop use the Windows platform to Run.
You may find it present in an undesired location, i.e.
Simply copy the file from the temp directory into the bin. (See the path described below.)
Reboot your RStudio, and establish the connection again.
Also, this may lead to some problems in some systems. To counter this, set the Java path from inside Rconsole.
java_path <- normalizePath(‘C:\\Program Files\\Java\\jre1.8.0_121’)
An Easy Example
Since we have a connection established, we can now perform various operations against the tables within the cluster. Copy some datasets from R into the Spark cluster.
Note: You may need to install the nycflights13 and Lahman packages in order to execute this example.
To start with, here is a simple filtering example:
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights, “flights”)
batting_tbl <- copy_to(sc, Lahman::Batting, “batting”)
We can see the view for the table loaded in Spark. Also, the SparkUI button will show you the dataset present in Spark.
Filter by departure delay and print the first few records.
flights_tbl %>% filter(dep_delay == 2)
This is how we integrate R with Spark.