The post Frequently Asked Hive Technical Interview Queries appeared first on AcadGild.
]]>Before moving to this blog user can refer our below link blogs to more on hive.
https://acadgild.com/blog/hive-beginners-guide
https://acadgild.com/blog/bucketing-in-hive
https://acadgild.com/blog/hive-real-life-use-cases
Then, let us begin with queries.
Scenario 1: write a query to find friends of friends i.e, mutual friends of a user.
DATASET:
Here we are using dataset freind_details to achieve the above objective which contains 2 columns user_name, and user_friend_name.
DOWNLOAD LINK:
You can download the friend_details dataset from the below link.
Once the input dataset is downloaded you use the below command to create table and load the dataset into the created table.
CREATE TABLE:
CREATE TABLE FRIEND_DETAILS(USER_NAME STRING, USER_FRIEND_NAME STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA:
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/Blog/Hive_Queries_Interview_Questions/Datasets/friend_list.txt' INTO TABLE FRIEND_DETAILS;
You can view the table values using the select all command.
VIEW DATA:
SELECT * FROM FRIEND_DETAILS;
Now, let us write the query to convert to find friend’s friends of a user.
QUERY:
SELECT f1.USER_NAME, f2.USER_FRIEND_NAME as user_friend_of_friend FROM FRIEND_DETAILS f1, FRIEND_DETAILS f2 WHERE f1.USER_NAME = 'prateek' AND f1.USER_FRIEND_NAME = f2.USER_NAME;
OUTPUT:
Here in the above query, we are creating a alias table name as f1 and f2, applying where condition to get mutual friends of the user prateek.
As we can see from the above result Onkar is a friend of Prateek and ajit and sumit are friends of Onkar, thus mutual friends of Prateek are ajit and sumit.
Scenario 2: In the SALES table quantity of each product is stored in rows for every year. Write a query to transpose the quantity for each product and display it in columns?
DATASET:
Here we are using dataset prod and sales to achieve the above objective.
DOWNLOAD LINK:
You can download the product details and product sales datasets from the below link.
DATASET DESCRIPTION:
PROD:
SALES:
Once the input dataset is downloaded you use the below command to create table and load the dataset into the created table.
CREATE TABLE PROD:
CREATE TABLE PROD(PROD_ID INT,PROD_NAME STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA:
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/Blog/Hive_Queries_Interview_Questions/Datasets/product_details.txt' INTO TABLE PROD;
VIEW DATA:
SELECT * FROM PROD;
CREATE TABLE SALES:
CREATE TABLE SALES(SALE_ID INT, PRODUCT_ID INT, YEAR INT, Quantity INT, PRICE INT)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA:
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/Blog/Hive_Queries_Interview_Questions/Datasets/product_sales.txt' INTO TABLE SALES;
VIEW DATA:
SELECT * FROM SALES;
Now, let us write the query to convert to transpose the quantity for each product and display it in columns.
Here the result should contain the product name and the total product quantity sold in each year.
EXPECTED RESULT:
prod_name quantity_sold_in_2016 quantity_sold_in_2017 quantity_sold_in_2018
QUERY:
SELECT P.PROD_NAME, MAX(IF(S.YEAR=2016, S.QUANTITY, NULL)) QUAN_2016, MAX(IF(S.YEAR=2017, S.QUANTITY, NULL)) QUAN_2017, MAX(IF(S.YEAR=2018, S.QUANTITY, NULL)) QUAN_2018 FROM PROD P, SALES S WHERE (P.PROD_ID = S.PRODUCT_ID) GROUP BY P.PROD_NAME;
In the above query, we have used IF condition to create year columns and performed groupby operation based on the product name.
OUTPUT:
We can see from the output we have successfully shown the result of the total amount of products sold in yearly wise.
Scenario 3: Write a query to find the number of products sold in each year?
For, the above query we will be referring to the sales table which we have used in the above scenario.
QUERY:
SELECT YEAR, COUNT(1) NUM_PRODUCTS FROM SALES GROUP BY YEAR;
OUTPUT:
As we can observe we have successfully achieved our object to find the number of products sold in each year.
We hope this post has been helpful in understanding how to perform data analysis using Hive. In the case of any queries, feel free to comment below and we will get back to you at the earliest.
Keep visiting our site ACADGILD for more trending blogs updates on Big Data and other technologies.
The post Frequently Asked Hive Technical Interview Queries appeared first on AcadGild.
]]>The post Apple WWDC 2019 appeared first on AcadGild.
]]>In this blog, we will be seeing the upcoming features of Apple products like iPhone, MAC, IPAD, watch and other Apple products which got announced in the developer’s conference.
So, what is WWDC?
WWDC is the acronym for Apple World Wide Developers Conference where Apple showcases its new features and technologies for software developers. Attendees can participate in hands-on labs with Apple Engineers and can cover a wide variety of topics.
This year’s WWDC held in San Jose, California on Monday. Let us see the remarkable updates which got announced:
Now let us see each new version in brief.
iOS 13
Apple completely reinvented the Reminders App from ground level to make it more intelligent, intuitive, powerful and easier to create reminders.
For instance, just type what you want and the Reminders App will understand when and where to notify you. You can also use the quick type bar by just tapping and enter things like location or like photos with tasks which helps you to organize and keep track of the most important items. If you tag a person in the Reminders App, next time when a message conversation with the tagged person the user will receive an automatic notification letting them know that now is time to talk.
watchOS 6
macOS Catalina
tvOS 13
iPadOS
The 3rd Generation Mac Pro
Private and Secure
Homekit secure video
This brings us to the end of our blog. Hope this blog was helpful to give information on new updates in Apple Apps and Products.
Keep visiting our website for more new blogs on technological advancement and other Data Science and Big Data related blogs.
The post Apple WWDC 2019 appeared first on AcadGild.
]]>The post Google Assistant appeared first on AcadGild.
]]>As we know, Google is the most used search engine. Google conducts a technical conference every year since 2008, where they introduce new technologies or add-on to the existing technologies.
Let us study about the Google conference and technology introduced by them in detail.
Google conducts an annual developer conference called Google I/O every year. The first Google I/O took place in the year 2008. Where I/O stands for Input/Output as well as describes the slogan ‘Innovation in the Open’.
The conference mainly includes technical sessions where attendees learn about developing applications for all Google’s platforms. The attendees are also provided with hands-on labs, where they can test whatever they have learned and get help from experts.
In one of such conferences in 2016, Google Assistant was launched.
Google assistant is an artificial intelligence powered virtual assistant developed by Google and is available on smart- devices. It is an advanced version of Google now, that can be used in a two-way conversation.
Users can interact with their natural voice as well as by giving input through the keyboard. The catchphrase for enabling the assistant is ‘Ok Google’. This wakes the assistant up and we can ask for anything we want.
The assistant makes a two-way conversation using Natural Language Processing Algorithm.
Let us understand what Natural Language Processing Algorithm is?
NLP, which stands for Natural Language Processing is a branch of AI(Artificial Intelligence) that helps computers understand, interpret and manipulate human language.
NLP is a way by which computers understand, analyze and derive meaning from human languages such as Hindi, English, Spanish, etc.
The challenges in Natural Language Processing frequently involves speech recognition, natural language understanding, and natural language generation.
This is how the Google Assistant understands our language, turn it into speech and reply to us with text and speech.
Activities that Google Assistant can do:
Google Assistant can help the users with the following things:
and a lot more things.
What’s new with Google Assistant?
Lately, Google introduced Google Lens in the Google I/O 2017 conference, which was integrated into Google Assistant.
Google lens is an Image-Recognition technology which is designed to provide relevant information about objects that it identifies using visual analysis based on a neural network.
Let us understand what Neural Network is:
A Neural Network is a circuit of Neurons or in a modern sense, an Artificial Neural Network is a composition of artificial neurons or nodes.
It is an attempt to simulate the network of neurons that make up a human brain so that the computer will be able to learn things and make decisions in a human-like manner.
In the Artificial Intelligence field, Artificial Neural Networks his applied to speech recognition, image analysis, and adaptive control.
How does Google Lens work?
When directing the phone’s camera at an object, Google Lens will attempt to identify the object, read barcodes and QR codes, labels and text, and shows relevant search and information.
For eg: when pointing the device’s camera at a WiFi label containing the network name and password, it will automatically connect to the WiFi source that has been scanned.
Google Lens uses more advanced Deep Learning routines.
What else can Google Lens do?
and a lot more.
Now let us understand what Deep Learning is:
Deep Learning is a part of Machine Learning based on Artificial Neural Network. Machine Learning uses an algorithm to parse data, learn from that data and make informed decisions from it.
Deep Learning structures algorithms in layers to create an artificial neural network that can learn and make important decisions on its own.
The device support is limited and requires Android Marshmallow(6.0) or newer.
Virtual assistant like Cortana for Windows, Siri for iOS and Google Assistant for Android are all fantastically capable virtual assistants and they have come along rapid progress in the last few years.
But according to the user’s experience, Google Assistant seems to be evolving the fastest and most effectively in Android devices as it is better at picking up what we are saying and responding to our queries quickly.
Google Assistant is pretty flawless on most platforms and though it certainly benefits by integrating with some of Google’s other services.
We hope this blog has been useful in understanding trending technologies.
You can expect more technical awareness blogs in future until then keep visiting our website Acadgild for more updates on Data Science and other technologies.
The post Google Assistant appeared first on AcadGild.
]]>The post Data Manipulation using R appeared first on AcadGild.
]]>In this article, we will be performing data manipulation operations using the dplyr package on Houston flights dataset which is available in R.
Data manipulation is an operation which is performed on an existing dataset in order to generate the objective of a result. furthermore, these operations can be finding patterns, queries, sorting, filtering, removing a field with empty data column values, modify/update a given data and more.
To achieve all the above operations we will be using the Huston flight sample data set.
Here we will be using two packages i.e in our examples,
Dplyr is a package which is used to perform data manipulation operations, It introduces, easy-to-use functions to reveal new variables and new observations in new ways for the reason to describe data that are very handy when performing exploratory data analysis and data manipulation.
Hflights package contains Houston airports dataset which is available with R by default. This dataset is created by the Bureau of transportation statistics for research and innovation technology administration. This dataset with 227,496 rows and 21 columns- Variables.
Year, Month, DayofMonth; DayOfWeek; DepTime, ArrTime; UniqueCarrier; FlightNum; TailNum; ActualElapsedTime; AirTime, ArrDelay, DepDelay; Origin, Dest origin and destination airport codes; Distance; TaxiIn, TaxiOut; Cancelled: cancelled indicator: (1 = Yes, 0 = No); CancellationCode: reason for cancellation: A = carrier, B = weather, C = national air system, D = security; Diverted: diverted indicator: (1 = Yes, 0 = No).
First, we will install dplyr package to perform data manipulation operations and hflights package to use Houston flights dataset.
In the below code we can see how to install packages.
> install.packages(“hflights”)
> install.packages(“dplyr”)
And load the library function of dplyr package and hflights as well.
We can use help() function to see the description of Houston flights data in the right side on the Rstudio help section.
By using view() we can see how actual dataset looks.
Here in the above table, we can see hflights dataset.
Now by using Str(), you can see the structure of hflights.
In the above console, we can see that it is a data frame and 227496 objects of 21 variables. Also, variables having integer and character in it.
Let us see the summary of the dataset by using the summary() function.
Here in console, it is showing the mean, median, mode, class, NA’s values.
If you want to select() from select function for any specific columns from the data frame you have to use pipe operator %>% which comes inside the dplyr package. Here Pipe operator trying pipe the operation select with the df, taking the df and select these three columns i.e; Airtime, TaxiIn, TaxiOut.
df is very large which we don’t want to print because it may take longer. Instead, we should store the query into a updated_df object.
In the environment section, we can see the new updated_df has same 227469 observations but the number of the column has changed because we have selected only 3 columns.
If we want all the variables except few those 3 variables we have selected we can use -c so these column name will not be present.
So Let’s check it out by the code above to drop variables.
In the environment section, we can see the new updated_df has same 227469 observations but the number of the column has changed because we have dropped selected variables and a total of 18 remaining variables are showing.
Now let’s see with sample_frac() function to create a new data frame to find the small sample out of large dataset we have and we do not want to store all the data.
Following the above code, we have taken a sample size of 0.2 i.e; 20% of the dataset. So if you want to do some kind of analysis in the same subset of the data and you can do it by selecting a random data sample by using function sample_frac.
In the environment section, we can see the new sample updated_df has some 45499 observations that are 20% of the original dataset and the number of the column remains unchanged because by sample_frac we are selecting only random numbers of observation.
We are using mutate function to create a new column or new variables in existing data.
In the below code we have used pipe operator %>% trying to pipe the operation select with the df and store into the updated_df object.
On the right-hand side of the R studio, we can see in the environment section to see updated_df with 22 variables which have been added created by using mutate function.
By clicking on updated_df we can easily view the modified dataset.
The above data set is showing the AvgSpeed added into the new column.
Now let us see how to group the data by using group_by() function and summarise function.
In the below code we want to see Month and avg_delay we have to use group_by(), to group the data frame. group_by() function generally used to groups the data frame by multiple columns with mean, sum or other functions.
Also, summarise() function is to reduce the data by grouping. So you can group with the help of group function and summarise it.
On the right-hand side of the R studio, we can see in the environment section to see updated_df with 22 variables which have been added created by using mutate function.
By clicking on updated_df we can easily view the modified dataset.
Here in the updated_df using group_by() function we can see Monthly and the avg_delay columns.
Likewise, if we want to see DayOfWeeks and avg_delay only. We will follow the same procedure. We have to use group_by() to group the data frame and summarise it by reducing the data by using summarise() function.
On the right-hand side of the R studio, we can see in the environment section to see updated_df which has 7 observations and 2 variables which has been created by using group_by() function.
By clicking on updated_df we can easily view the modified dataset.
Here in the updated_df using group_by() function we can see Monthly and the avg_delay columns.
If you want to find the information of flight which has high departure time, We are grouping the flight number by calculating the average departure delay for each flight number.
Here, are using group_by() function to group the data frame and summarise it by reducing the data by using summarise() function.
On the right-hand side of the R studio, we can see in the environment section to see the updated_df which has 3740 observations and 2 variables which has been created by using group_by() function.
Click on the updated_df dataframe to view the modified dataset. From the below table we can see the flights (flight numbers) that have the minimum average delay and maximum delay.
Here in the updated_df dataframe using group_by() function we can see flight number and the avg_delay columns. 1817 is the flight number which has the minimum average delay of -10.0000 minutes (i.e, before the arrival time the plane has landed ) and 4493 is the flight number that has maximum departure delay of 281 minutes (that means it is delayed by this many minutes after the departure time).
Now say you want the information of departure delay average for each month and for each flight separately, for this objective we have to group two variables i.e, Month and FlightNum.
We will follow the same steps as we have done to find the FlightNum and the avg_delay in the previous scenario.
We have to use group_by() function to group the data frame and summarise it by reducing the data by using summarise() function.
On the right-hand side of the R studio, we can see in the environment section to see updated_df dataframe which has 14872 observations and 3 variables which has been created by using group_by() function.
Click on the updated_df dataframe to view the modified dataset. From the below table we can see the flights (flight numbers) that have the minimum average delay and maximum delay.
Now we can see separate columns for departure delay for each month and for each flight in the below table.
From the above data table, we can see the information of the variables summary statistics for the columns Month and FlightNum.. using group_by() function.
Example, What is the maximum departure or which/Find the month which as the maximum departure delay or Find the day or date that as maximum departure delays. For these kinds of queries you need to use group_by() function first, then summarise() function to aggregate the data and then you can find the month has or the day that has the maximum departure delay.
Summarise is a function used to perform aggregate operations like to find minimum value or the maximum value or the mean value in a given dataset.
To find the basic statistics of any variable, we have arrival delay (ArrDelay) variables so, if you see the data frame df which is the original data frame. We see the arrival delay, we can see the minimum value, it means it is not delayed rather it reaches the destination 70 minutes prior to its time.
If you want to see ArrDelay and the maximum arrival delay is 978 minutes to its time.
The data set has some NA values i.e; missing values. The first and foremost we have to find the missing data in our original df (Here you have mentioned the original df has missing data. So explain once). we need to remove missing data so that we can find a summary of the data because these NA (Is the NA represents the missing values) values will hinder in our calculations.
This is how our dataset looks like when we scroll down to dataset.
Let’s go the first filter this original data frame. We will use df and then filter() function !is.na and arrival delay.
On the right-hand side of the R studio, we can see in the environment section to see updated_df which has 223874 observations and 21 variables which has been created by after dropping NA values.
By clicking on updated_df we can easily view the modified dataset.
Here in the above data set, we can see there is no NA values left because we have filtered those.
Now after filtering we are going to find the summary of this arrival delay column using summarise() function. So we can use filter() function and summarise() function to find min,average, max arrival delay.
Here, #summarise #aggregate function #summary about ArrDelay #min(x) - minimum value of vector x #max(x) - maximum value of vector x #mean(x) - mean or average value of vector x
So here in the console, we can see the min_delay is -70, avg_delay = 7.094 and max delay =978.
We can also modify the data frame using filter() and summarise() function.
On the right-hand side of the R studio, we can see in the environment section to see updated_df which has 1 observation and 3 variables which have been created by using filter() function.
By clicking on updated_df dataframe we can view the modified dataset.
So here in the above table, we have updated_df dataframe displaying min_delay is equals to -70, avg_delay is 7.094334 and max_delay is 978 after removing NA values.
Now we are going to use the filter function to filter the rows based on the condition used inside the filter function on a column.
If you see the original data frame you can see the distance variable after shorting in order the minimum value is 79 and the maximum value is 3904.
First, take df on pipeline line operator i.e; %>% then use the filter function and only take those rows where the distance variable is greater than 2000.
On the right-hand side of the R studio, we can see in the environment section to see the updated_df dataframe which has 918 observations and 21 variables which has been created using filter() function where the distance variable value is greater than 20000.
Here we want to select those rows where the distance is greater than in 2000. So it will filter all the rows which are lesser than the distance 2000 km.
Click on the updated_df to view the modified dataset.
Here in the above updated_df table, we can see the distance column/distance variable consists of values that are greater than 2000.
Now let’s say you want only DayOFWeek, the minimum DayOFWeek 1 and the maximum is 7 and it is coded based on Monday to Sunday.
We want only Saturday and Sunday i.e; 6 and 7. So how we would do, we will use the DayOfWeek variable. Here we can write it as DayOfWeek%in%c(6,7).
That means DayOfWeek should have only these values as vector 6 and 7.
On the right-hand side of the R studio, we can see in the environment section to see updated_df which has 59687 observations and 21 variables which has been created by using filter() function.
So DayOfWeek is either six minimum and maximum will be 7.
Here in the above updated_df table, we can see DayOfWeek is 6 and 7 selected by filter() function.
So we can filter based on many conditions you can use lesser, greater, equal for numeric variable or character variable and also if you just want the variable to have only a few values then can use %in% where you can mention the values which you want to have like 6,7 if you want only 3 then just write DayOfWeek == 3 so it will give DayOfWeek to be only 3.
So, we have covered pipe operator %>% , select(), sample_frac(), mutate(), group_by(), filter() and summarise() function in data manipulation using dplyr package on Houston flights data with R.
From the above examples, we believe this blog helped you in understanding Data manipulation operations using the dplyr package on Houston flights data with R.
https://acadgild.com/blog/linear-model-building Using Airquality Data Set with R.
Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.
The post Data Manipulation using R appeared first on AcadGild.
]]>The post Top 10 Python Libraries For Data Science in 2019. appeared first on AcadGild.
]]>In this blog, we will be discussing on most popular python libraries which are used in data science projects.
Before moving forward to know the frequently used popular python libraries we recommend beginners to have a good revision on python by referring the below series of videos.
Core Libraries
Matplotlib
Seaborn
Python is the leading programming language and has gained a leading position in solving data science task and challenges. So, now we will show you some useful and powerful python libraries which are used in data science For example to do the scientific calculations and other stuff.
All the libraries are open sourced so if you find these libraries are helpful in your business you can donate them on their official websites.
In very simple terms a library is a file which consists of some useful code, on the other hand, this code could be a simple function or collection of function, variables, and classes.
Now, let us look at the most used python libraries for data science.
Firstly, we start our list with the libraries that are used in scientific applications and numPy is one of the top libraries which is used for processing large multidimensional arrays and matrices.
It has a collection of high-level mathematical functions and methods.
We can also use NumPy in complex mathematical operations like Fourier transformation, linear algebra, random number, etc. Since it is an array interface which allows the user to reshape the datasets.
In the following example, we will be creating a one-dimensional array using numpy.
In the below code, we have performed the following steps:
import numpy as np # Importing NumPy package as np a = np.array([1,2,3]) # Creating An array with funtion array print(a) # Printig array a.dtype # Checking Data Type Of the Array
numpy.genfromtxt function can be used to read files.
In the below code, we have performed the following steps:
student = np.genfromtxt("student-data.csv", delimiter=";", skip_header=1)
Syntax : arange([start,] stop[, step,], dtype=None)
This function will take four parameters as below:
Start of interval. The interval includes the start value. The default start value is 0.
End of interval. The interval does not include the stop value, except in some cases where “step” is not an integer and floating point round-off affects the length of “out”.
Spacing between values. For any output “out”, this is the distance between two adjacent values, “out[i+1] – out[i]”. The default step size is 1. If “step” is specified as a position argument,
dtype is the type of the output array. If “dtype” is not given, infer the data type from the other input arguments.
In the below code, we have performed the following steps:
np.arange(3) np.arange(3.0) np.arange(3,7) np.arange(3,7,2)
Syntax: np.broadcast(self, /, *args, **kwargs)
numpy.broadcast function takes parameters like
Int1, Int2,….:
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.
In the below code, we have performed the following steps:
x = np.array([[1], [2], [3]]) y = np.array([4, 5, 6]) b = np.broadcast(x, y)
Another important python library for researchers, developers and data scientists which can be used in scientific calculations is Scipy. Scipy is based on the data structures of Numpy and therefore it extends the capabilities of Numpy.
SciPy contains modules for linear algebra, optimization, integration, and statistics. It is built upon the numPy because of that only thus make the substantial use of numPy.
In the below code, we have performed the following steps:
from scipy import misc misc.imread('Image_Name.png')
The module for traditional algebra operations is thought as scipy.linalg and you’re needed to import it well before any operation.
To calculate the determinant of a matrix, we’ll use scipy.linalg.det() to operate within the following way:
from scipy import linalg
In the below code, we have performed the following steps:
mat = np.array([[2,1],[4,3]]) #For a square matrix ‘mat’ linalg.det(mat)
Another function is inv() that can be used to compute the inverse of the square matrix.
In the below code, we have performed the following steps:
mat = np.array([[2,1],[4,3]]) #For a square matrix ‘mat’ linalg.inv(mat)
scipy.special module contains a list of transcendental performs, that are most frequently utilized in operation across varied disciplines. Here is that the syntax for a few of the foremost used to perform from the scipy.special modules are:
# To calculate the area under a Guassian curve, we use erf() function like this: scipy.special.erf() # Syntax for Gamma function: scipy.special.gamma() # In order to calculate the log of Gamma, we use the following syntax: scipy.special.gammaln() # Elliptic Function scipy.special.eppilj() # Nth order Bessel Function scipy.special.jn()
As the name suggests statsModels is one of the python libraries which is used for statistical calculations. This module provides the functions and classes for the estimation of many different statistical models
It can conduct statistical tests and statistical data exploration. To ensure the results are correct the results are tested against existing statistical packages.
This library is open source so if you find these libraries are helpful in your business you can donate on their official websites.
To import the stats model library or we can call it module we use below command.
import statsmodels.api as sm
Pandas are data science libraries which are used for loading, processing and to do analysis on the data available. Also, we can do the analysis through this library. Pandas are designed to do the work with “labeled” and “relational” data.
Pandas is one of the best tools for data wrangling and which is the most important step in data science. There have been a few new releases of the pandas’ library, including hundreds of new features, enhancements, bug fixes, and API changes. The improvements regard pandas abilities for grouping and sorting data, more suitable output for the apply method, and the support in performing custom types operations.
If the input file is in the format of CSV then we can use pd.read_csv function to read the .csv file.
In the below code, we have performed the following steps:
# Importing pandas module import pandas as pd #Reading a CSV files pd.read_csv('file_name.csv')
If the data is in the form of excel then we can use pd.read_excel function to read the excel data.
We can use two functions to read an excel file.
In the below code, we have performed the following steps:
# Importing pandas module import pandas as pd #Reading a Excel files pd.read_excel('file_name.csv') xlsx = pd.ExcelFile('your_excel_file.xlsx')
In the below code, we have performed the following steps:
import pandas as pd data = {'Country': ['Belgium', 'India', 'Brazil'], 'Capital': ['Brussels', 'New Delhi', 'Brasília'], 'Population': [11190846, 1303171035, 207847528]} df = pd.DataFrame(data,columns=['Country', 'Capital', 'Population']) df.shape() df.index() df.columns() df.count()
df.sum() df.cumsum() df.min() df.max() df.describe() df.mean() df.median()
Sometime you might want to delete some unwanted columns so we have functions like:
df.drop('Country', axis=1)
Missing data/values may harm the data when an analysis is performed on a given dataset. To handle these missing data/values, to drop a value or a column/columns to clean the missing data we can use the below functions.
Now let us see the different types of missing values which can be seen and can be handled in data science wrangling modules
The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in Python code. Because it is a Python object, None cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type ‘object’ (i.e., arrays of Python objects):
import numpy as np import pandas as pd vals1 = np.array([1,None,3,4]) print(vals1)
Secondly, We have other missing data representation, NaN (an acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:
import pandas as pd import numpy as np vals2 = np.array([1,np.nan,3,4]) vals2.dtype
Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into the compiled code. You should be aware that NaN is a bit like a data virus–it infects any other object it touches. Regardless of the operation, the result of arithmetic with NaN will be another NaN:
Pandas data structures have two useful methods for detecting null data: isnull() and notnull(). Either one will return a Boolean mask over the data. For example:
data = pd.Series([1, np.nan, 'hi', None]) data.isnull() data[data.notnull()]
The isnull() and notnull() methods produce similar Boolean results for DataFrames
In addition to the masking used before, there are the conventional methods, dropna() (which removes NA values) and fillna() (which fills in NA values). For a Series, the result is straightforward:
data.dropna()
data.fillna(0)
Another library which is popular for creating graphs which will help you to take the decision for your data.it is the low-level library for creating two-dimensional diagrams and graphs.
There have been style changes in colors, size, and fonts, etc.
Before we dive into the details of creating visualizations with Matplotlib, there are a few useful things you should know about using the package.
Just as we use the np shorthand for NumPy and the pd shorthand for Pandas, we will use some standard shorthands for Matplotlib imports:
import matplotlib as mpl import matplotlib.pyplot as plt
We will use the plt.style directive to choose appropriate aesthetic styles for our figures. Here we will set the classic style, which ensures that the plots we create use the classic Matplotlib style:
plt.style.use('classic')
In order to display your plot, we have functions or you can call the methods:
In the below code, we have performed the following steps:
import matplotlib.pyplot as plt import numpy as np x = np.linspace(0, 10, 50) plt.plot(x, np.sin(x)) plt.plot(x, np.cos(x)) plt.show()
Oftenly you may have to import the images while coding or to display it so for this you can use the Image object which is present in the IPython library.
from IPython.display import Image Image('my_figure.png')
Seaborn is a high-level API based on the matplotlib library. It has a rich gallery of visualizations including some complex types like time series, joint plots, and violin diagrams.
Below image shows you the line plot with the seaborn library.
Before we dive into the details of creating visualizations with seaborn, there are a few useful things you should know about using the seaborn package.
Follow the below code to import the seaborn library:
import matplotlib.pyplot as plt import seaborn as sns
Again in categorical plots, there are types of plots available as below:
In the below code, we have performed the following steps:
import matplotlib.pyplot as plt import seaborn as sns iris = sns.load_dataset("iris") sns.stripplot(x="species", y="petal_length",data=iris) sns.swarmplot(x="species", y="petal_length", data=iris)
In the below code, we have performed the following steps:
titanic = sns.load_dataset("titanic") sns.barplot(x="sex", y="survived", hue="class", data=titanic)
In the below code, we have performed the following steps:
titanic = sns.load_dataset("titanic") sns.countplot(x="deck", data=titanic,palette="Greens_d")
In the below code, we have performed the following steps:
titanic = sns.load_dataset("titanic")
sns.pointplot(x="class", y="survived", hue="sex", data=titanic, palette={"male":"g", "female":"m"}, markers=["^","o"], linestyles=["-","--"])
In the below code, we have performed the following steps:
sns.boxplot(x="alive", y="age", hue="adult_male", data=titanic) sns.boxplot(data=iris,orient="h")
In the below code, we have performed the following steps:
sns.regplot(x="sepal_width", y="sepal_length", data=iris)
In the below code, we have performed the following steps:
# Plot univariate distribution plot = sns.distplot(data.y, kde=False,color="b")
In the below code, we have performed the following steps:
sns.heatmap(uniform_data,vmin=0,vmax=1)
Scikit learn is simple and effective tools for data mining and data analysis.it is accessible to everybody, and reusable in various contexts.
Everyone can use this library in their organization because it is open source.
Xgboost stands for eXtreme Gradient Boosting, to be highly efficient, flexible and portable, this library is optimized for distributed Gradient Boosting.
To train gradient-boosted decision trees and other gradient boosted models we use this library.
In addition, we can integrate xgboost with AWS YARN, Spark (Big Data Tools).
One of the great and popular library which will help you to to develop and train your machine learning models.
Firstly, tensorflow is an open source library for machine learning.it has flexible and effective ecosystem tools, libraries and resources that let the developer build and deploy machine learning applications.
Natural Language Processing is manipulation or understanding text or speech by any machine or software.in NLP instead of human, computers have the responsibility to interacts, understand and respond with the appropriate answer.
NLTK stands for Natural Language Toolkit. One of the most powerful library which contains packages to make the machine understand the human language and respond with the appropriate answer.
We hope this post was helpful to you to know the most used libraries in data science projects.
Keep visiting our website AcadGild for further updates on data science and other technologies.
The post Top 10 Python Libraries For Data Science in 2019. appeared first on AcadGild.
]]>The post Data Manipulation with Pandas appeared first on AcadGild.
]]>If you are new and want to know about NumPy refer to the below link for a detailed study on NumPy.
https://acadgild.com/blog/data-manipulation
Pandas is a python package that provides fast, flexible and expressive data structure that is designed to work with 1D and 2D data and that makes data manipulation and analysis easy.
There are the following data structures that Pandas libraries work on:
To begin coding with Pandas we have to first install it. Installation of Pandas requires NumPy to be installed.
Once Pandas is installed, we can import it and check the version:
We can provide an alias name to import pandas:
import pandas as pd
This import convention will be used throughout the coding in this blog.
Let’s deep dive into Series, Dataframe, Missing values and filling the missing values using Pandas.
INTRODUCTION TO PANDAS SERIES OBJECT
A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array. This has been explained in the below code:
In the above code, we have imported NumPy to access any of its required functions.
Series is the object which we had called using the alias ‘pd’.
In the output, the first column refers to the index and the second column refers to its related values, which we can access with the ‘index’ and ‘values’ attributes respectively as shown by the below code:
Unlike the NumPy Array that has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.
That means in Pandas Series the index need not be an Integer value, it can be of any desired data type. Let us see this with the below code:
In the output, the index is of the type ‘String’.
Pandas Series can also be thought of as a Python Dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important as it makes it much more efficient than Python dictionaries for certain operations.
Let us see this with the help of an example:
We can access items dictionary-styled data as follows:
The series also supports array-style operations such as slicing:
INTRODUCTION TO PANDAS DATAFRAME OBJECT
Pandas DataFrame is a two-dimensional data structure, where data is aligned in a tabular fashion in rows and columns.
Creating a DataFrame using List:
Creating DataFrame from dictionary:
To create DataFrame from dictionary, all the arrays should be of the same length.
Creating a DataFrame from a Series:
In the previous program that we have executed already, we will add one more dictionary to it, in the code shown below
As in the above code, we can see a new Series named state has been created which consists of the states of the 4 cities. Then using DataFrame we have added 2 columns namely Area and State.
In the output shown, the first column can be accessed by using the attribute ‘index’ as shown in the below example:
Likewise, other columns can be accessed by using the attribute ‘column’
OPERATIONS ON DATA
Pandas make use of some functions and methods that can be used to combine datasets. These methods include concat, merge and join.
concat(): To concatenate the DataFrames along the row we use the concat() function in pandas. We have to pass the names of the DataFrames in a list as the argument to the concat() function, which is shown in the below example:
import pandas as pd import numpy as np df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}, index = [0, 1, 2, 3]) df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'], 'B': ['B4', 'B5', 'B6', 'B7'], 'C': ['C4', 'C5', 'C6', 'C7'], 'D': ['D4', 'D5', 'D6', 'D7']}, index = [4, 5, 6, 7]) df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'], 'B': ['B8', 'B9', 'B10', 'B11'], 'C': ['C8', 'C9', 'C10', 'C11'], 'D': ['D8', 'D9', 'D10', 'D11']}, index = [8, 9, 10, 11]) pd.concat([df1, df2, df3])
Output:
In the above code, we have created 3 DataFrames df1, df2 and df3, which we have concatenated using the function concat().
merge(): This function is also used to merge or add two DataFrames, while it looks for one or more matching column names between the two inputs and uses this as the key.
Sometimes this merging is not done so efficiently, therefore this function provides some keywords to handle this, which we will discuss later.
Let us see an example of this:
Using the ‘on’ keyword we can explicitly specify the name of the key column, which takes a column name or a list of column names.
Other keywords are like:
left_on/right_on: When we have two DataSets with the same column but different column name, we can use the left_on and right_on keywords to specify the two column names.
left_index/right_index: when we have to merge two DataSets based on index we can use left_index/right_index.
HANDLING MISSING DATA
Let us discuss how to handle missing data but before that let us understand what missing data is.
Missing data can occur when the information provided for one or more than one data. Missing data in Pandas is represented by NaN (Not a Number) or None.
Let us see how missing data occurs:
In the above program, we have created an array of dimension 5×3 and later-on reindexed to 8 rows, where data for some of the indices go missing so we get missing values as NaN. For re-indexing, we used an attribute reindex which changes the row-label and column-label of a DataFrame.
CHECK FOR MISSING VALUES
We can check for missing data by using isnull() and notnull() functions:
In the above program, the attribute isnull() checks for null value and wherever the values are missing, it returns true.
FILLING MISSING VALUES
We can fill in the missing values by using function like fillna():
In the above program we used the function fillna() passing attribute as 1. Hence the value ‘1’ gets filled in the place of missing value.
DROPPING MISSING VALUES
We can drop null values from a DataFrame using dropna() function. By default, this function works along rows.
We hope this post has been helpful in understanding the working of Pandas, its various operations and some other concepts which has been explained with the help of codes and the output.
In future, you can expect more blogs on Python libraries, until then keep visiting our website Acadgild for more updates on Data Science and other technologies.
The post Data Manipulation with Pandas appeared first on AcadGild.
]]>The post Install Nifi On Linux appeared first on AcadGild.
]]>As we know data is stored on different machines, databases, and other sources.
Regularly, the user has to write API’s in different languages in order to collect or store data from source to destination. If the user is not good at coding, user can use NiFi which is a drag and drop based GUI data flow framework which allows the user to connect multiple sources and allow to flow data to the destination without or much need of programming.
Apache NiFi is a data flow management system which comes with a web UI built providing an easy way to handle data flows in real time. Most important aspect understand for a quick start into nifi is flow-based programming. In plain, terms you create a series of nodes with a series of edges to create a graph that a data moves through. In nifi these nodes are processors and edges are connectors, and the data is stored within a packet of information known as a flow file. This flow file has things like content, attributes, and age.
Download and Install NiFi in Linux system.
Download:
We can visit the official website of Apache Nifi and download the NiFi.
Now, select the Downloads button, then select nifi-1.3.0-bin.zip
Select the below link and download the file
http://apache.mirror.serversaustralia.com.au/nifi/1.3.0/nifi-toolkit-1.3.0-bin.zip
Once the file is downloaded unzip the nifi zip file
nifi.properties is the file where we can edit the web URL port number where the user will be creating workflows.
Next, create a copy of nifi.properties file in case the original nifi.properties file is edited and corrupted in the conf directory.
You can do it by below command
cd nifi-1.3.0 cd conf Copy the nifi properties. cp nifi.properties nifi.properties.old
Now, we can work around with original nifi.properties file if incase nifi.properties file is corrupted we can use the copied file to view and change the original nifi.properties file.
Open the nifi.properties file using gedit
gedit nifi.properties
We can see the web port url number in the web properties block. Whereupon using the port number user can work on nifi.
# web properties # nifi.web.http.port = 8080
We can change the above port number if you need not use the above port number
Example:
nifi.web.http.port = 9999
Save and close the nifi.properties file
Start NiFi
To start nifi the user should go to the nifi bin directory and type the below command to start the nifi
./nifi.sh start
NOTE: If we want to know the stats of the service you can follow the below path of log directory to know that.
cd logs tail -f nifi-app.log
Which references the web server or the application that runs the nifi in the background.
Now you can see the message stating nifi is successfully started. And the UI for nifi is available at the following URLs:
localhost:8080/nifi
We can use the above URLs in the Linux operating system to start using nifi.
From the above image, we can see nifi is successfully installed and the empty campus where we can add processes.
NiFi Working Example:
In the next example, we will be creating a workflow where random files are created from the source and stored in a specific location.
To create workflow the user should use Add processor component which is available in the Components Toolbar of NiFi.
Click of Add processor component button, drag and drop this buttion into the canvas.
Right click on the drag and dropped Add processor component and select the type Generateflowfile.
GenerateFlowFile is a processor type which is used to create flow files with random data.
Once the GenerateFlowFile processor is created, create a new processor again by selecting the processor component, then drag and drop the processor component into the canvas.
Now, right click on the new processor, configure and select putfile process
PutFile is a processor type which is used to write the contents of a flow file into the local file system.
Now, right click on putfile processor, select configure option
In configure, processor window Select Automatically Terminate Relationships failure and success checkbox button.
Now, go to properties tab of configuring processor window
– In the property select Directory and in the value enter the path where the randomly generated files will be stored.
/home/acadgid/Desktop/NiFi/StorageFolder
– In the Conflict Resolution Strategy set the value replace.
– In the Create Missing Directories set the value as true.
– Now, click on the Apply button to save the changes made.
Now link both Generateflowfile and putfile process in order to create a successful workflow.
In the below image we can see workflow to generate the random file and store these randomly generated files in the local file system is created successfully.
Now right click on GenerateFlowFile processor and select option start to generate random flow files.
And then right click on PutFile processor and select option start to collect and store the random flow files which is been generated by GenerateFlowFile processor.
After a few seconds, we can see random files are generated and stored in the specified path.
Right, click on both processor and select Stop option to stop the currently running processes.
Now go to the StorageFolder and see the result where you can see files which got generated and stored in the specified destination folder /home/acadgid/Desktop/NiFi/StorageFolder
cd StorageFolder.
ls
To check the size of the present working directory use the below command.
du -sh StorageFolder
To see the number of files generated in the destination directory using the below command.
ls -l | wc -l
From the above example, we can see we have successfully created a workflow where random files are generated and stored in the specified location.
We hope this post has been helpful in understanding the working of NiFi. In the future, you can expect more blogs on nifi, until that keep visiting our website Acadgild for more updates on Big Data and other technologies.
The post Install Nifi On Linux appeared first on AcadGild.
]]>The post Data Manipulation with NumPy appeared first on AcadGild.
]]>Refer to the below blog link to have a better understanding of NumPy basics:
https://acadgild.com/blog/data-manipulation
Now, let us first understand Universal Function.
A Universal Function is a function in NumPy that operates on nd-array, and which supports array broadcasting, type-casting, and other standard features. This means a ufunc(Universal function) supports vectorized operation, that can be accomplished by performing operations on the array, which will then be applied to each element.
Features:
Use of Ufunc: Computations in NumPy can either be very fast or can be very slow. To make operations fast, they are generally implemented through NumPy’s Universal Functions.
Starting with the Arithmetic Functions such as addition, subtraction, multiplication and division of array, we will follow the following steps:
Let us see this with the help of an example
import numpy as np a = np.array([2, 4, 8, 1]) # adding 2 to every element print ("Adding 2 to every element:", a+2) # subtracting 1 from each element print ("Subtracting 1 from each element:", a-1) # multiplying each element by 10 print ("Multiplying each element by 10:", a*10)
We can also modify an existing array by performing some more operations on it such as:
To square each element of an array:
print ("Squaring each element:", a**2)
To print the transpose of a Matrix:
a = np.array([[5, 6, 7], [8, 9, 10], [21, 22, 23]]) print("Original Matrix \n ", a) print("Transpose of Matrix a \n ", a.T)
To fetch the max and min element of an array
x = np.array( [ [5, 8, 11], [4, 1, 9], [10, 12, 19]]) print ("Largest element is:", x.max()) print ("Row-wise maximum elements:", x.max(axis = 1)) print ("Smallest element is:", x.min()) print ("Column-wise smallest elements:", x.min(axis = 0)
In the above code, the first statement is to print the highest/smallest element of the array named x and the second statement is to print the highest/smallest element row-wise/column-wise.
Here we made use of axis, where axis = 0 means column wise and axis = 1 means row-wise.
To get the sum of an array
print ("Sum of all array elements:", x.sum())
Likewise, we can find the cumulative sum along each row
print ("Cumulative sum along each row:\n", x.cumsum(axis = 1))
Until now we have learned about some of the unary operator and functions. Now we will read about some of the binary operators and functions.
Binary operators are the one where 2 operators are being used.
Here we will perform addition, subtraction, multiplication of 2 arrays/matrices.
To perform addition, subtraction, and multiplication on 2 arrays
a = np.array( [ [1, 2], [3, 4] ] ) b = np.array( [ [5, 6], [7, 8] ] ) print("Sum of 2 arrays \n", a+b) print("Difference of 2 arrays \n", a-b)
To perform array multiplication and matrix multiplication:
# multiply arrays print ("Array multiplication:\n", a*b) # matrix multiplication print ("Matrix multiplication:\n", a.dot(b))
From the above example, we can see that two operations were performed.
Array multiplication i.e, element-wise multiplication in which the first element of array ‘a’ gets multiplied to the first element of array ‘b’, the second element of array ‘a’ gets multiplied to the second element of array ‘b’ and so on.
And Matrix multiplication i.e., each row of array ‘a’ gets multiplied to each column of array ‘b’. Here we made use of the function dot().
These were the usage of some of the operators and functions.
Some of the others functions are listed below
add(), subtract(), negative(), multiply(), divide(), floor_divide(), power(), mod(), abs() etc.
Consider the below code to understand the concept of sin, cos, tan:
arr = np.array([0, 30, 60, 90, 120, 150, 180]) x = arr * np.pi/180 print("\nThe sin value of the angles\n") print(np.sin(x)) print("\nThe cosine value of the angles\n") print(np.cos(x)) print("\nThe tangent value of the angles\n") print(np.tan(x))
Exponential: Another common type of operation available in a NumPy ufunc are the exponentials
Consider the following example:
x = [3, 4, 5] print("x =", x) print("e^x =", np.exp(x)) print("2^x =", np.exp2(x)) print("3^x =", np.power(3, x))
Logarithmic: The inverse of exponentials i.e, the logarithms, are also available. The function log() gives the natural logarithm
x = [2, 4, 6, 8] print("x =", x) print("ln(x) =", np.log(x)) print("log2(x) =", np.log2(x)) print("log10(x) =", np.log10(x))
Sorting Arrays
Sorting means arranging data in a systematic manner.
Until now we have seen examples of various operations being performed on an array. In this section, we will see how sorting of arrays is being done.
NumPy’s sort() function is the most efficient and useful function which could serve our above purpose.
Let us see the working with the help of an example
x = np.array([2, 41, 14, 53, 25]) np.sort(x)
NumPy’s argsort() function is one more function which returns the indices of the sorted elements.
x = np.array([2, 41, 14, 53, 25]) i = np.argsort(x) print(i)
In the result shown above the first element gives the index of the smallest array element, the second element gives the index of the second smallest array element and so on.
Introduction to Broadcasting
NumPy provides a powerful mechanism called Broadcasting. It is simply a set of rules for applying binary ufuncs(eg.: addition, subtraction, multiplication etc) on arrays of different size
A = np.array([ [11, 12, 13], [21, 22, 23], [31, 32, 33] ]) B = np.array([1, 2, 3]) print("Addition with broadcasting: ") print(A + B) print("Multiplication with broadcasting: ") print(A * B)
In the above code each element of array A is added to each element of array B row-wise. Likewise, for multiplication, each element of array A is multiplied to each element of array B row-wise.
Introduction to Fancy Indexing
When we have to access multiple elements at once we make use of Fancy Indexing. Fancy Indexing is conceptually simple, it means passing an array of indices to access multiple elements.
Let us see an example to understand this:
Consider the following array:
rand = np.random x = rand.randint(100, size=10) print(x)
If we want to access 3 different elements, we can do it as
[x[3], x[7], x[9]]
Or we can pass a single list of array of indices to obtain the same result
ind = [3, 7, 9] x[ind]
Fancy Indexing also works on the multi-dimensions array.
Consider the following array:
X = np.arange(9).reshape((3, 3)) X
To get the element from the array, we will make use of 2 parameters, the first index refers to the row, and the second to the column:
row = np.array([0, 1, 2]) col = np.array([2, 1, 1]) X[row, col]
As from the above example, we can see that the first element [2] is the result of the index [0, 2] that has been taken one from each variable.
For Example: Let us see how the above operations are performed on arrays and matrices with the help of some use cases
Problem Statement 1: Create 3*3 matrices of random single-digit numbers and perform arithmetic operations on it such as addition, subtraction, multiplication, division, pow, etc.
sol:
output:
the above program, we performed arithmetic operations such as addition, multiplication, division and square root using pow function.
Problem Statement 1: Create 3*3 matrices of random single-digit numbers and apply universal functions on it such as square root, exponential, log10, Modulo etc.
sol:
Output:
In the above program operations performed such as square root of matrix 1st by using function sqrt(), exponential power of all elements of matrix 1st (The exponential function is e^x where e is a mathematical constant called Euler’s number, approximately 2.718281 and x is the element of matrix 1st), logarithmic value of elements of matrix 1st and the modulo remainder i.e, the remainder on dividing matrix 1st by matrix 2nd.
Broadcasting
NumPy’s broadcasting rule relaxes this constraint when the arrays’ shapes meet certain constraints. When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions and works its way forward. Two dimensions are compatible when they are equal, or one of them is 1.
Problem Statement 1: Create 4*3 matrices of all zeroes and perform operations on it as given.
1.1: create a rank 1 ndarray with 3 values and add it to each row of ‘arr’ using broadcasting
sol:
1.2: create an ndarray which is 4 x 1 to broadcast across columns and add it to each column of ‘arr’ using broadcasting
sol:
In the above program, we created an array of dimension 4*3 with all zeroes and perform specific operations on it.
Hope this post is helpful in understanding the working of NumPy and its various operations explained with the help of use cases.
In future, you can expect more blogs on Python libraries, until then Keep visiting our website Acadgild for more updates on Data Science and other technologies.
The post Data Manipulation with NumPy appeared first on AcadGild.
]]>The post Linear Regression Model Building appeared first on AcadGild.
]]>In this blog, we will be discussing how to use a linear regression model to find and build a prediction model.
Here we will be using the Airquality data set which is available in R to build a linear regression prediction model.
Before going further, first of all, understand what is linear regression and its significance.
Linear regression establishes a relationship between dependent variables i.e Y and independent variables i.e X using a best fit straight line known as a regression line. Generally denoted as R^2.
Let us see the syntax of the linear model :
when we use the lm() function, we imply the dataframe using the data = parameter.
df = dataframe that contains variables.
target ~ predictor syntax is basically telling the lm() function what is the “target” variable which we want to predict and what our “predictor” variable is – the x variable that we’re using as an input for the prediction.
Y = mX + c where m= slope of straight line and c= Y-intercept
Or y=b0+b1x where, y = Predicted value, b0= Intercept, b1= Slope, x = Predictor
Here the dependent variables are for target variables which can be continuous and independent variables for predict or predictor which can be continuous or discrete.
Air quality is a standard built-in data set that makes it convenient to work on linear regression. You can access this data set by typing air quality in your R console. You will find that it consists of 153 observations (rows) and 6 variables (columns) – Ozone, Solar.R, Wind, Temp, Month, Day.
Load the data set in R and process it; the code flow is given below:
The command view(airquality) reflect the data set in your R environment.
We can use the below code to check air quality data set from R console.
In the above command, we have used str() command that shows you it is a data frame and 153 observations of 6 variables are present.
We can also check with the head() command, it will take first 6 records by default.
Let’s process the data set.
We can see the summary of the data set which shows the NA values or missing values using the summary() command.
Now let’s give input monthly mean in Ozone and Solar.R to replace missing values with Mean.
In the above code, for 1:nrow taking first to the last number of rows of the data set, if there are any missing values, we can check with is.na command. Now we accept argument na.rm=True and the particular missing value is replaced by monthly mean by mean() command for Ozone and Solar.R.
We can see in the above console that there is no NA value or missing value left. This is a very important part when we are dealing with the data cleaning part.
We will discuss more data cleaning/data wrangling process in the upcoming blogs.
In the code below, we can see Normalization rescales the values into a range of [0,1], also called min-max scaled.
We can see in the console that Normalization transforms the data into a range between 0 and 1 and there are no outliers or missing values left in the data set.
Now apply the Linear regression algorithm using the Least Squares Method on “Ozone” and “Solar.R”
In the below code we select the target attribute Y i.e Ozone and Predictor attribute X i.e Solar.R to build the model_1 and check the correlation between X and Y with lm() function.
We observe that model_1 provides the regression line coefficient that is slope and Y – intercept.
The above graph shows the scatter plot between X and Y.
Here we are adding a regression line to scatter plot to see the relationship between X and Y.
The slope of the line goes upward, hence there exists a positive correlation between Ozone and Solar.R.
Now, if we increase the value of X, the value of Y will also increase, and vice versa.
The above graph shows the regression line between X and Y, and the positive correlation between the X and Y attributes.
In the below code, we select the target attribute Y i.e Ozone, and Predictor attribute i.e. X. We have to build the model_2 and check the correlation between X and Y with lm() function.
Apply linear regression algorithm using Least Squares Method on “Ozone” and “Wind”
We can see that model_2 provides the regression line coefficient that is slope and Y – intercept.
The above graph shows the scatter plot between X and Y
Here we are adding a regression line to scatter plot to see the relationship between X and Y.
The slope of the line goes downward, hence there exists a negative correlation between Ozone and Wind.
So if we increase the value of X, the value of Y will decrease, and vice versa.
The above graph shows the regression line between X and Y and the negative correlation between the X and Y attributes
Hence the required prediction of Ozone level is 1.049993 when solar radiation is 10.
Hence the required prediction of Ozone level is -21.46849 when the wind is 5.
From the above example, we believe this blog helped you to understand Linear Regression Model Building using Air Quality data set with R.
You can refer the link https://acadgild.com/blog/55690-2 to learn Mean Median and Mode using R.
Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies.
The post Linear Regression Model Building appeared first on AcadGild.
]]>The post Soccer Data Analysis Using Apache Spark SQL (Use Case) appeared first on AcadGild.
]]>
1) Inbuilt machine learning libraries.
2) Efficient in interactive queries and iterative algorithm.
3) Provides highly reliable fast in-memory computation
4) Provides processing platform for streaming data using spark streaming
5) Fault tolerance capabilities because of immutable primary abstraction named RDD.
6) Highly efficient in real-time analytics using spark streaming and Spark SQL.
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0
Here we have launched spark shell to write application or code. We have used the package from data bricks which will help us to read the data from CSV easily.
import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val data = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter",",").load("/home/acadgild/Desktop/Soccer_Data_Set.docx.csv")
data.registerTempTable("olympics")
first of all, you can see in the previous we have created a data frame from the CSV file. also, we are creating a temporary table from the data frame in order to execute the queries which we want.
val result1= sqlContext.sql("select Country,count(Medal) as Medal from olympics where
Sport=='Football' and Medal=='Bronze' group by Country").show()
val result2= sqlContext.sql("select Count(Medal) as Medal,country,sport from olympics where
Country=='USA' group by Sport,country").show()
val result3= sqlContext.sql("select Country,Medal,Count(Medal) as Count from olympics group
by Medal,country").show()
val result4= sqlContext.sql("select Country,Medal,Count(Medal) as Count,year from olympics
where Medal='Silver' and Country='MEX' group by Medal,country,year").show(false)
We hope the above blog helped you to understand the detailed functioning of HDFS. Keep visiting our site for more updates on Big data and other technologies. Click here to learn Scala language which is used in spark.
The post Soccer Data Analysis Using Apache Spark SQL (Use Case) appeared first on AcadGild.
]]>