Free Shipping

Secure Payment

easy returns

24/7 support

  • Home
  • Blog
  • Breast Cancer Data Analysis using Spark

Breast Cancer Data Analysis using Spark

 July 9  | 0 Comments

The purpose of this blog is to walk you through a sample use case scenario on Data Analysis using Spark.
Apache Spark is the distributing computing framework which provides high-level APIs in Java, Python, Scala and R. A classic spark program runs in parallel to many nodes in the cluster. It also braces a rich set of high-level tools such as:

  • Spark SQL for SQL and Structured data
  • Mlib for Machine Learning
  • GraphX for Graph processing
  • Spark Streaming

Now, let’s understand how data analysis can be performed using Spark.
Breast Cancer Analysis – Dataset
The clinical dataset in this blog is released for the awareness of breast cancer. For practice, few problems have been designed with the solution which makes the user understand better.
Breast cancer is a disease in which the cells in the breast grow out of control. There two types of breast cancer namely,

  • Invasive ductal carcinoma – The most common type of breast cancer is ductal carcinoma which begins in the cells of the ducts. Breast cancer can also begin in the cell lobules and in other tissues. Ductal carcinoma in situ is a condition in which abnormal cells are found in the lining of the duct but they haven’t spread outside the duct.
  • Invasive Lobular Carcinoma – Breast cancer that spreads from where it began in the ducts or lobules to surrounding tissues is called invasive breast cancer.

In inflammatory breast cancer, the breast looks red and swollen and feels warm because the cancer cells block the lymph vessels in the skin.
In the U.S., breast cancer is the second most common cancer in women after skin cancer. It can occur in both men and women, but it is rare in men. Each year there are about 100 times more new cases of breast cancer in women than in men.
Dataset link – Click Here to download Cancer Data Set
Dataset Description
The below table gives you the dataset description:

Column No. Column Name datatype Description
Column 0 Complete_TCGA_ID string Unique ID for people
Column 1 Gender string Female
Column 2 Age_Initial_diag int Age of the person at the time of admission
Column 3 ER_Status string Estrogen receptors are often referred to as ER-positive (or ER+) cancers.
Column 4 PR_Status string Breast cancer cells have progesterone receptors, the cancer is called PR-positive breast cancer.
Column 5 HER2_Final_Status string HER2-positive breast cancer is a breast cancer that tests positive for a protein called human epidermal growth factor receptor 2 (HER2), which promotes the growth of cancer cells. In about 1 of every 5 breast cancers, the cancer cells have a gene mutation that makes an excess of the HER2 protein.
Column 6 Tumor string Numbers after the T (such as T1, T2, T3, and T4) might describe the tumor size and/or amount of spread into nearby structures. The higher the T number, the larger the tumor and/or the more it has grown into nearby tissues.
Column 7 Node string Numbers after the N (such as N1, N2, and N3) might describe the size, location, and/or the number of nearby lymph nodes affected by cancer. The higher the N number, the greater cancer spread to nearby lymph nodes.
N0 means nearby lymph nodes do not contain cancer.
Column 8 Node_Coded string Positive means the cancer is present.
Negative means cancer not present.
Column 9 Metastasis string M0 means do not contain cancer.
M1 means do contain cancer.
Column 10 Metastasis_Coded string Positive means the cancer is present.
Negative means cancer not present.
Column 11 AJCC_Stage string The AJCC staging system is a classification system developed by the American Joint Committee on Cancer for describing the extent of disease progression in cancer patients.
Column 12 Converted_Stage string After treatment any change in AJCC_stage.
Column 13 Survival_Data_Form string
Column 14 Vital_Status string The person is living or dead/deseased.
Column 15 Days_to_Date_of_Last_Contact int Number of days passed while contacting to a person.
Column 16 Days_to_date_of_Death int Number of days passed after the death of a person.

Now, let’s understand the problem statements in the breast cancer analysis and the solution to the same using Spark.
Problem Statements
#1 – What is the average age at which initial pathologic diagnosis to be done?
The below code snippet in Spark helps you to find the average age at which the initial pathologic diagnosis to be done:

val data = sc.textFile("")
val header = data.first()
val remove_header = data.filter(x => x!=header)
val avg_age = => x.split(",")).map(x => x(2).toInt).reduce(_+_)/remove_header.count

#2 – Find the average age of people of each AJCC Stage?
The average age of people of each AJCC stage can be found by the help of the below code snippet:

val data = sc.textFile("file:///home/kiran/Documents/datasets/clinical_data_breast_cancer.csv")
val header = data.first()
val remove_header = data.filter(x => x!=header)
val stages = => x.split(",")).map(x => (x(11),x(15).toInt)).mapValues((_, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).mapValues{ case (sum, count) => (1.0 * sum)/count}.foreach(println)

#3 – Find out the people with vital status and their count?
The below code snippet helps you to get the total count of people and their vital status:

val data = sc.textFile("")
val header = data.first()
val remove_header = data.filter(x => x!=header)
val status = => x.split(",")).map(x => (x(14),1)).reduceByKey(_+_).foreach(println)

We hope this blog article explaining a sample use case of breast cancer analysis using Spark helped you.
To become a successful Big Data Developer & build your Data Analysis skills, enroll in our Big Data Hadoop & Spark Training by Acadgild.