In this post, we will be analyzing the crimes dataset of New York using SparkSQL. In case you are not familiar with SparkSQL, please refer to our post on Introduction to SparkSQL.
This dataset is available publically, reflects the reported incidents of crime (with the exception of murders, where data exists for each victim) that has occurred in the City of Chicago from 2001 to present. The data is extracted from the New York Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.
You can download the dataset from here.
Below is a sample record from the dataset
You can click here for the complete data set column wise description.
In this post, we will be using pyspark shell for writing our queries.
Find number of crimes that happened under each FBI code.
#import SQLContext and row
from pyspark.sql import SQLContext,Row
#load the data set and split the records
parts = lines.map(lambda l: l.split(","))
# construct the Rows by by passing a list of key/value pairs as kwargs
Crimes = parts.map(lambda p:Row(Id =p,case_no=p,date=p,block=p,IUCR=p,Primary_type=p,description=p,Loc_des =p,arrest=p,domestic= p,beat=p,district=p,ward=p,community=p,fbicode=p,XCor=p,YCor=p,year=p,Updated_on=p,lattitude=p,longi=p,loc=p))
# Create the DataFrame and register it has Table
#run the query for getting the required result
result=sqlContext.sql("select fbicode,count(fbicode) as count from Crimes group by fbicode")
2. Find number of ‘NARCOTICS’ cases filed in the year 2015.
We have already read the data created from the Data Frame and registered as a table with the name ‘Çrimes’, in the first problem statement. Now, we can directly run the queries on this table.
result=sqlContext.sql("select count(*) as count from Crimes where Primary_type ='NARCOTICS' and year = 2015 ")
3. Find the number of theft related arrests that happened in each district.
result=sqlContext.sql(“select district ,count(*) as count from Crimes where Primary_type =’THEFT’ and arrest = ‘true’ group by district “) result.show()