In this instructional post, we will discuss the spark SQL use case Hospital Charges Data Analysis in the United States. You can find below a description of the dataset.
In this Spark SQL use case, we will be performing all the kinds of analysis and processing of the data using Spark SQL.
DRG Definition: The code and description identifying the MS-DRG. MS-DRGs are a classification system that groups similar clinical conditions (diagnoses) and procedures furnished by the hospital during their stay.
Provider Id: The CMS Certification Number (CCN) assigned to the Medicare-certified hospital facility.
Provider Name: The name of the provider.
Provider Street Address: The provider’s street address.
Provider City: The city where the provider is located.
Provider State: The state where the provider is located.
Provider Zip Code: The provider’s zip code.
Provider HRR: The Hospital Referral Region (HRR) where the provider is located.
Total Discharges: The number of discharges billed by the provider for inpatient hospital services.
Average Covered Charges: The provider’s average charge for services covered by Medicare for all discharges in the MS-DRG. These will vary from hospital to hospital because of the differences in hospital charge structures.
Average Total Payments: The average total payments to all providers for the MS-DRG including the MSDRG amount, teaching, disproportionate share, capital, and outlier payments for all cases. Also included in the average total payments are co-payment and deductible amounts that the patient is responsible for and any additional payments by third parties for coordination of benefits.
Average Medicare Payments: The average amount that Medicare pays to the provider for Medicare’s share of the MS-DRG. Average Medicare payment amounts include the MS-DRG amount, teaching, disproportionate share, capital, and outlier payments for all cases. Medicare payments DO NOT include beneficiary co-payments and deductible amounts nor any additional payments from third parties for coordination of benefits.
*Note: In general, Medicare FFS claims with dates-of-service or dates-of-discharge on or after April 1, 2013, incurred a 2 percent reduction in Medicare payment. This is in response to mandatory across-the-board reductions in Federal spending, also known as sequestration.
You can download the dataset used in this spark SQL use case from here.
The data set used in this Spark SQL Use Case consists of 163065 records. Using this data, we will be evaluating a few problem statements using Spark SQL.
*Note: In this Spark SQL Use Case, we are using Spark-2.0.
In Spark-2.0, we can load a CSV file directly into the Spark SQL context as follows:
val session = org.apache.spark.sql.SparkSession.builder.master("local").appName("Spark CSV Reader").getOrCreate; val df = session.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("file:///home/kiran/Documents/datasets/inpatientCharges.csv")
With this, we have loaded all the CSV data as a DataFrame into Spark SQL. Here, we have used inferschema as an option so it will automatically infer the data type of the columns. We can also check the schema of the DataFrame as follows:
root |-- DRGDefinition: string (nullable = true) |-- ProviderId: integer (nullable = true) |-- ProviderName: string (nullable = true) |-- ProviderStreetAddress: string (nullable = true) |-- ProviderCity: string (nullable = true) |-- ProviderState: string (nullable = true) |-- ProviderZipCode: integer (nullable = true) |-- HospitalReferralRegionDescription: string (nullable = true) |-- TotalDischarges: integer (nullable = true) |-- AverageCoveredCharges: double (nullable = true) |-- AverageTotalPayments: double (nullable = true) |-- AverageMedicarePayments: double (nullable = true)
You can see that the schema is the same as our data description.
To see the contents inside the DataFrame, type the following:
Next, we will save the data in a table by registering it in a temp table as shown below.
Problem Statement 1: Find the amount of Average Covered Charges per state.
It gets easier with Spark SQL. One only requires a line of code to evaluate this problem statement.
You can see the average amount of AverageCoveredCharges per state as your output. The same is shown in the screen shot below.
Similarly, we can find the Average Total Payments & Average Medicare Payments per state as follows.
Problem Statement 2: Find the amount of Average Total Payments charges per state.
Here is the query to find out the AverageTotalPayments charges per state.
Problem Statement 3: Find the amount of Average Medicare Payments charges per state.
Here is the query to find out the AverageMedicarePayments charges per state.
Problem Statement 4: Find out the total number of Discharges per state and for each disease.
It is easier to find with Spark SQL.
Here is the one line of code that helps evaluate the above problem statement.
This is shown in the following screenshot:
With Spark SQL, sorting of these records is also very easy. You can use either sort or orderBy functions to sort the output. By default, these functions sort the output in an ascending order, you can also modify it by passing the function desc into the sort or orderBy functions.
In the above screenshot, you can see that the records have been sorted in a descending order by the column sum (“TotalDischarges”).
We hope this instructional post of Spark SQL use case helped you in understanding how to perform interactive SQL queries using Spark SQL.
You can also refer to our other Spark SQL Use Case from the below links
Spark SQL Use Case – Machine and Sensor Data Analysis
Keep visiting our website www.acadgild.com for more updates on courses and other technologies.