Big Data Hadoop & Spark

Analyzing New York Crime Data Using SparkSQL

In this post, we will be analyzing the crimes dataset of New York using SparkSQL. In case you are not familiar with SparkSQL, please refer to our post on Introduction to SparkSQL.

Dataset Description:

This dataset is available publically, reflects the reported incidents of crime (with the exception of murders, where data exists for each victim) that has occurred in the City of Chicago from 2001 to present. The data is extracted from the New York Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.

You can download the dataset from here.

Below is a sample record from the dataset

You can click here for the complete data set column wise description.

In this post, we will be using pyspark shell for writing our queries.

Problem Statement:

  1. Find number of crimes that happened under each FBI code.

Code:

#import SQLContext and row
from pyspark.sql import SQLContext,Row
sqlContext=SQLContext(sc)
#load the data set and split the records
lines =sc.textFile("hdfs://localhost:9000/Crime_dataset")
parts = lines.map(lambda l: l.split(","))
# construct the Rows by by passing a list of key/value pairs as kwargs
Crimes = parts.map(lambda p:Row(Id =p[0],case_no=p[1],date=p[2],block=p[3],IUCR=p[4],Primary_type=p[5],description=p[6],Loc_des =p[7],arrest=p[8],domestic= p[9],beat=p[10],district=p[11],ward=p[12],community=p[13],fbicode=p[14],XCor=p[15],YCor=p[16],year=p[17],Updated_on=p[18],lattitude=p[19],longi=p[20],loc=p[21]))
# Create the DataFrame and register it has Table
schema1=sqlContext.createDataFrame(Crimes)
schema1.registerTempTable("Crimes")
#run the query for getting the required result
result=sqlContext.sql("select fbicode,count(fbicode) as count from Crimes group by fbicode")
result.show()

Output:

2. Find number of ‘NARCOTICS’ cases filed in the year 2015.

Hadoop

We have already read the data created from the Data Frame and registered as a table with the name ‘Çrimes’, in the first problem statement. Now, we can directly run the queries on this table.

Query:

result=sqlContext.sql("select count(*) as count from Crimes where Primary_type ='NARCOTICS' and year = 2015 ")
result.show()

3. Find the number of theft related arrests that happened in each district.

result=sqlContext.sql(“select district ,count(*) as count from Crimes where Primary_type =’THEFT’ and arrest = ‘true’ group by district “) result.show()

We hope this blog helped you in getting grip over SparkSQL concepts.Keep visiting our website for more blogs on Big Data, Spark and other technologies.

Spark

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles

Close