Big Data Hadoop & Spark

HealthCare Use Case With Apache Spark

Apache Spark can be used for a variety of use cases which can be performed on data, such as ETL (Extract, Transform and Load), analysis (both interactive and batch), streaming etc.
In this blog, we will explore and see how we can use Spark for ETL and descriptive analysis. We will make use of the patient data sets to compute a statistical summary of the data sample.

How can Spark help healthcare?

A number of use cases in healthcare institutions are well suited for a big data solution. Some of the academic or research oriented healthcare institutions are either experimenting with big data or using it in advanced research projects. In healthcare industry, there is large volume of data that is being generated. Electronic Health Record (EMR) alone collects a huge amount of data. But apart from EMRs, there are various other sources of data in healthcare industry.
Over the last decade, pharmaceutical companies have been aggregating years of research and development data into medical databases and because of this, the patient records have been digitized. In parallel, recent technical advances have made it easier to collect and analyze information from multiple sources which is a major benefit for health care institutions, since data for a single patient may come from various hospitals, laboratories, and physician offices.
With a lot of medical data coming from various sources, guided decisions can be made from the insights gained through big data by using various Machine Learning Algorithms. Traditionally, physicians use their judgment while making treatment decisions, but in the last few years there has been a shift towards evidence-based medicine. This involves systematical review of clinical data and making treatment decisions based on the best available information. Aggregating individual data sets into big-data algorithms often provides the most robust evidence, since nuances in sub populations (such as the presence of patients with gluten allergies) may be so rare that they are not readily apparent in small samples.

About the data

It is difficult and expensive to access Electronic Medical Records (EMRs) due to privacy concerns
and technical problems. In healthcare, HIPAA compliance is non-negotiable. Nothing is more
important than the privacy and security of patient data.
Hence, to overcome this problem the data is generated using a machine as per pre-defined
The database contains the same characteristics that exist in the actual medical database such as patients’ admission details, demographics, socioeconomic details, labs, medications, etc.
The database records and features are customizable. The generated data is around 2 GB of the simulated EMR data.


















The fields in this data set are defined as follows:

  • patient_id: Each new patient is identified by this number
  • DOB: The patient’s date of birth. We have considered patients born on or after 28-12-1950
  • Gender: F-Female; M-Male
  • marital_status: Divorced, Single or Married
  • smoking_status: Smoking habit of the patient
  • city: The city to which the patient belongs














The fields in this data set are defined as follows:

  • Diagnosis_id: This is a unique id for each diagnosis
  • Admission-id: This is a unique id for every patient admitted to the hospital
  • Patient_id: Each patient is identified by this number
  • diagnosis_ICD10_code: Standard code for every diagnosis that has been standardized in the healthcare industry. This code is independent of the hospital and hence can be used to identify a diagnosis across hospitals.


Patient encounters are continuously recorded into the hospital database as and when they visit the hospital. The following data set is thus generated:












The fields in this data set are defined as follows:

  • Admission_id: Each time a patient comes to hospital consultation, he/she is assigned a new number
  • Patient_id: Each patient is identified by this number
  • Admission_date: The day when the patient is admitted to the hospital
  • discharge_date: The day when the patient is discharge









Cholera due to Vibrio cholerae 01, biovar cholerae


The fields in the data set are defined as follows:

  • ICD_10_Code: ICD-10 (International Classification of Disease version 10) code is assigned for each standard diagnosis
  • Diagnosis_description: Description of the diagnosis

The Scenario

We will consider a scenario where we will use a hypothetical EMR, similar to the one which exists in actual healthcare institutions. The patient’s data has a variety of parameters associated with it, for example, basic demographic information (gender, location, etc.), patients identified diagnosis, etc.

A typical data science project flow is shown below:

We’ll use Python, PySpark and MLib to compute some basic statistics for our dashboard. It involves some of the typical steps to be followed in Spark and get started with your own use case:

  • Reading data from File System into a Spark RDD

  • Applying transformations to “massage” the data into a pair RDD

  • Compute summary statistics for each user and check the distribution of data

Scenario I:

Calculate patient’s age and age group from his date of birth given in EMR

  • Load the data from patients.csv

patientfile = sc.textFile(‘file:///opt/spark_usecases/medical/datasets/patients.csv’)

  • Check the number of records which are going to be processed



  • Calculate patient’s age and age group and then save it on to the memory.

We will be repeating operations on this RDD.

# create feature/attribute from the existing attributes like age and age group

patient_demographics = patientfile.filter(lambda line: ‘patient_id’ not in line ).map(lambda line: patient_attributes(line))



def patient_attributes(str):

l = str.split(“,”)

return [l[0],l[1], l[2], l[3], l[4], l[5],int(prepare_date(l[1])),age_group(int(prepare_date(l[1])))]


def prepare_date(date_form):

year,month,day = [int(x) for x in date_form.split(“-“)]

try :

born = date(year, month, day)

except ValueError: # raised when birth date is February 29 and the current year is not a leap year

born = date(year, month, day-1)

return calculate_age(born)


def calculate_age(born):

today =

return today.year – born.year – ((today.month, < (born.month,


def age_group(age):

if age < 10 :

return ‘0-10’

elif age < 20:

return ’10-20′

elif age < 30:

return ’20-30′

elif age < 40:

return ’30-40′

elif age < 50:

return ’40-50′

elif age < 60:

return ’50-60′

elif age < 70:

return ’60-70′

elif age < 80:

return ’70-80′

else :

return ’80+’


Scenario II

Find the distribution of data for each patient attribute

  • Find the distribution of male and female patients

patient_gender = patientfile.filter(lambda line: ‘patient_id’ not in line ).map(lambda line: (line.split(‘,’)[2].strip(),1)).reduceByKey(lambda a,b:a+b).map(lambda line:(line[1],line[0])).sortByKey(False).collect()

[(524, u’F’), (476, u’M’)]

  • Find distribution for married_status

patient_married_status = patientfile.filter(lambda line: ‘patient_id’ not in line ).map(lambda line: (line.split(‘,’)[3].strip(),1)).reduceByKey(lambda a,b:a+b).map(lambda line:(line[1],line[0])).sortByKey(False).collect()

[(372, u’Divorced’), (321, u’Single’), (307, u’Married’)]

  • Find distribution for different age groups

patient_age_group_wise = line : (line[7],1)).reduceByKey(lambda a,b:a+b).map(lambda



[(166, ’10-20′), (162, ’50-60′), (152, ’40-50′), (151, ’30-40′), (139, ‘0-10′), (138, ’20-30′), (92, ’60-70’)]

  • Find top 5 cities from where we have most number of patients with patient frequency

patient_city_wise = line : (line[5],1)).reduceByKey(lambda a,b:a+b).map(lambda line:(line[1],line[0])).sortByKey(False).take(5)


[(5, u’Talegaon Dabhade’), (5, u’Adityapur’), (5, u’Mandamarri’), (5, u’Sikar’), (4, u’Pratapgarh’)]

  • Find distribution smoking_status/smoking habit

patient_smoking_wise = line : (line[4],1)).reduceByKey(lambda a,b:a+b).map(lambda line:(line[1],line[0])).sortByKey(False).collect()

[(256, u’Frequently’), (256, u’No’), (247, u’Once’), (241, u’Occasionally’)]

Hope this blog was helpful in giving you an overview on benefits of Spark in the healthcare industry.

Keep visiting our website Acadgild for more updates on Big Data and other technologies. Click here to learn Big Data Hadoop Development.
In the next blog we will create a profile of each user with various diagnosis, procedure and other attributes which can be obtained from the data.


Satyam Kumar

With more than 5 Years of experience, Satyam Kumar is a Subject Matter Expert in Big Data Solutions and has used his depth of experience to help bring new Big Data technologies to production. He has worked on several projects involving Hadoop, HDFS, MapReduce, Kafka, Flume, Hive and Spark.


  1. I am interested in putting in big data analytics test in the health sector , but I need the data to test since you have 2 GB

  2. I am working on Big Data in Health Informatics for My PhD so wanted relevant data for analysis. I will be using Apache Spark for that.

  3. Can you please give me access to the datasets . i would like to work on the healthcare dataset for my Big data course project.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles