Big Data Hadoop & Spark

Data Analyst vs Data Scientist?

Along with the rise of Big Data technology, new terms have evolved like Data Analyst, Data Scientist. Ever wondered what are the differences between these terms? After all, these terms have the common word- data. Well, the two terms are similar yet have some differences. In this blog, I will bring out this subtlety.

(Also read: Big Data: What it is and 5 reasons why companies are moving to Big Data)


Data Analyst

A Data Analyst is someone who analyzes large data sets, draws inferences from them, and projects this to the management using reporting tools. A Data Analyst usually has a degree in Computer Science or MBA and additionally needs to possess the following technical skills:

  • Have basic knowledge of statistics.
  • Able to use statistical programming languages like R, STATA, and SAS to manipulate data.
  • Have knowledge of programming languages like Python, or Ruby for web development or familiar with HTML and Java Scripting for front end development to present data.
  • Know SQL querying.
  • Knowledge of Excel can be useful but Excel is an old tool now.
  • Ability to use open source tools like Hadoop, Hive, Pig, Impala, and HBase – to improve productivity for analysis tasks.

Precisely, Data Analysts are people who can convert numbers in data into English sentences. This helps businesses to strategize. The challenge in presenting to management is even though analysis is done with statistical methods and terms, the presentation should be in business terminology-implying that a Data Analyst should have good communication skills too. Even though many areas are mentioned above, a Data Analyst need not attempt to master all of them – he or she can specialize in any one area. This leads us to the question:

How to become a Data Analyst?

There are 3 starting points:

  1. Starting with no knowledge of programming and math.
  2. Starting with programming background.
  3. Starting with strong mathematical background

Here is a step by step guide to upscale.

Starting with no knowledge of Programming and Math

  1. Programming: is a core skill needed for Data Analysts. This is the skill that differentiates a Data Analyst from a Business Analyst. You need to learn programming languages like java, R, or Python and a good understanding of the data science libraries like ggplot2, gplot2, reshape2, pandas etc.
  2. Statistics: For you to be able to analyze data, you have to familiarize with Descriptive and Inferential statistics. Descriptive-helps you analyze data and describe it in a meaningful way and Inferential-help in predictive measures that infer properties of the larger data set by interpreting the sample. For example: You can identify patterns emerging from data with this method of analysis. You may already know some of the basics of descriptive statistics from school like–mean, median, mode, standard deviation and variance, etc. Then you need to learn more about the complex statistical skills like comparing different samples with different types of data distribution: standard normal, exponential/poisson, binomial, chi-square; and tests for significance: Z-test, t-test, Mann-Whitney U, chi-squared, ANOVA). As a Data Analyst you’ll need to know how many samples to collect, how different factors should be applied internally, how to choose good control and testing groups, and so on.
  3. Math: A strong foundation in math is essential as the data usually is interpreted in numbers. You need to learn linear Algebra, Matrices and Calculus, and then be able to tackle the challenge to express the real life/business problems in terms of numbers – for this you will need to be able to manipulate algebraic expressions and solve equations. Finally, you should be able to represent data as graphs of functions and highlight the relationship between graphs.
  4. Machine Learning: You should know the common algorithms of machine learning. For a career as a data analyst, you won’t need to invent new machine-learning algorithms (such advanced skills like that are needed to become a data scientist), but you should know the most common of them. A few examples include principal component analysis, neural networks, support vector machines, and k-means clustering. It is not mandatory to not know the detailed theory and implementation details of these algorithms, but you should understand the pros and cons, as well as when to (and when not to) apply them to a dataset.

There are three main types of machine learning:

  • Supervised learning,
  • Unsupervised learning
  • Reinforcement learning.

In supervised learning, a computer program is provided with two sets of data, a training set and a test set. The computer uses the set of labeled examples in the training set to learn and identify unlabeled examples in the test set accurately. The computer program ultimately creates a rule and uses it on the test set. This is the type of program that sits in your phone and recognizes your voice.

There are specific tools that are used for this purpose. They are: decision trees, Naive Bayes classification, Ordinary Least Squares regression etc.

In Unsupervised learning a type of machine learning algorithm is used to draw inferences from datasets consisting of input data without labelled or known responses. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns. This is the algorithm applied by Netflix to recommend movies and Flipkart to predict products that you like.

The specific tools to be used in unsupervised learning are: clustering algorithms, Principal Component Analysis (PCA), Singular Value Decomposition (SVD) etc.

Lastly, the learning which falls between the above two methods of learning is reinforcement learning.  Here as the name implies, the computer has to determine the result in a specific context. Some of the tools you’ll need to use are: Q-Learning, TD-Learning, and and genetic algorithms.

  1. Data Wrangling, Visualization, and Intuition: To collect, organize and analyze data, you need to equip yourself with knowledge of SQL querying, Hadoop, Spark, MongoDB. After collecting and organizing data, you should know how to present it visually to stakeholders. Knowing tools like ggplot, matplotlib etc. will help you in doing so. Apart from these, you should have innate ability to know which data sets to consider and which data sets to leave out.

Hadoop

Starting with Programming background

If you are a software engineer or studied programming languages in college, here are the things you have to learn before applying for the role of a Data Analyst:

      1. Statistics: You should have the statistical skills mentioned above – be able to make statistical inferences, identify patterns, compare data sets, apply the right techniques.
      2. Math: Linear algebra, matrices, calculus and ability to solve equations are the basic skills needed to manipulate data and represent it as graphs and reports.

Starting with strong Mathematical background

If you are a Mathematical whizz kid and aspire to be a Data Analyst, you need to acquire the following programming skills:

      1. Basic programming: Variables, loops, functions, control flow etc.
      2. Object Oriented Programming: Learn to design your program so that is based on Object Oriented patterns and is easy to develop, test, and maintain.
      3. Data Structures: Learn Arrays, Stacks, Queues, Lists, and Graphs.
      4. Software Design Patterns: Many robust software design patterns are available – learn these design patterns.
      5. Algorithms: Learn which algorithms need to be applied to solve which kind of problems. This knowledge makes a huge difference to how long your data analysis takes to produce useful results.

 

Data Scientist is a statistician and a software engineer rolled into one.

What does a Data Scientist do?

      • First and foremost, when a business problem like customer retention or reducing costs is presented to a Data Scientist, he or she helps in solving that problem using data intensive ways. Usually during the process of solving those problems, some insights are discovered and inferred from the data sets.
      • Parallelize and iterate as fast as possible on the problem to be solved.
      • Build Data products like Dashboards, machine learning models and tools that others can use to analyze data.

Data Scientists choose the tools based on the field and context in which they work. The specific skills that Data Scientists have are:

      1. Expertise in math and statistics – to select the right algorithm to apply and derive models.
      2. Ability to use machine learning to make predictions
      3. Knowledge of ‘R’or Python – to do analysis and build models.
      4. Applying machine learning algorithms
      5. Sharp business acumen.

In short a Data Scientist should be an expert in: Math, Statistics, technology, and business. But in reality one person being an expert in all the areas is not possible. So, there are Data Science teams with team members having an expertise in one area but being able to talk to any other team member with expertise in another skill.

The combination of expertise in these areas is what places a Data Scientist above a Data Analyst. But it also means that a Data Analyst can grow into a successful Data Scientist.

How to become a Data Scientist is the next obvious question. Apart from equipping oneself with a degree in statistics or math, the simple steps or basic steps to be taken are, to get trained in:

      • Hadoop/Big Data programming.
      • Hive, Pig, and Impala.
      • Data Science & Business Applications of Data Science.
      • Fundamentals of Machine Learning.
      • Apache Mahout.

Conclusion

No doubt that Data analysis is a mushrooming field. If you are about to embark on a career in Data Analysis, the skills listed above are the building blocks and learning these skills does mandate investment but the payoffs are promising indeed!

Keep visiting our site www.acadgild.com for more updates on Bigdata and other technologies.

Hadoop

2 Comments

  1. Thank you for such a useful information and differentiation of both the job roles. I am a data analyst and want to be a Data Scientist. Please give some ideas or information by which I can become data scientist and help me to choose the right path. Thank you very much.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles

Close