All CategoriesCareers

15 Data Science FAQs – Top Interview Questions

Cracking the Job Interview

It’s obvious why data science is popular among professionals looking to rapidly advance in their careers. The rise of information technologies has brought about a data revolution. Not only is more data being created every day, but our ability to gather and learn from this data is also increasing. The trouble for organizations and businesses is not figuring out how to use data science anymore, rather to figure out who has the skills to use them. Especially, because there is an acute shortage of data scientists in the jobs market. It is estimated that India alone has a shortage of 2 lakh data professionals. The opportunities are plenty for anyone with data skills if they can demonstrate it. The following blog answers data science faqs to help you crack the job interview.

Data Science FAQs

1) How are machine learning and data science related?

Machine learning refers to the use of algorithms to help computers identify trends and patterns in big data. In a way, machine learning enables data science, but is not whole of it. Data science includes all the ways in which data is collected, organized, analyzed and interpreted for organizational and practical purposes. For a detailed explanation of the difference between artificial intelligence, machine learning, deep learning, and their relation to data science, click here.

2) What is A/B testing? Give an example.

A/B testing is essentially an experiment, which tests the effect of two variables A and B on the outcome of a process. For instance, the experiment can be used to test the effectiveness of two distinct banner ads on the audience by observing the outcome – click rates on the ads.

3) What are the different types of statistical analyses that can be conducted based on the number of variables?

Descriptive statistical analyses can be broadly divided into three types according to the number of variables involved. Univariate analyses focus on a single variable like the sales of Android phones in Mumbai. Bivariate analyses take two variables into consideration – the number of Android handsets in different Indian cities. In this case, the sales in individual cities is a one variable and the city itself is the second variable. Lastly, we have multivariate analyses which studies more than two variables like the effectiveness of two banner ads A and B on the sale of Android phones in different Indian cities.

4) What are co-founding variables?

Co-founding variables have a direct or indirect effect on both the dependent and independent variables. Hence, it is difficult to spot them in analysis. For instance, to study the mileage of different cars at a certain speed, we can observe the effect an independent variable – speed – has on the dependent variable – mileage. But the mileage could also be dependent on other independent variables such as weight of the car or quality of the fuel that runs it. These two variables directly affect both speed and mileage. Hence, they are co-founding variables that the scientist has failed to eliminate in analysis without gathering data on it.

5) What is logistic regression?

Logistic regression is like linear regression – it predicts possible outcomes by understanding the relationship between two variables. The difference between the two, however, is that linear regression predicts only one outcome, whereas logistic regression predicts a series of possible outcomes.

6) What are outliers and how to treat them in analysis?

Outliers are those values in the data set that lie at a great distance from other values. For example, if the universal set is made of five values – 1, 2, 3, 4, 30 – then 30 is an outlier. Outliers are problematic in analysis because they have a significant effect on the mean of the data set and thereby misrepresent the whole data set. In the above example, the mean, when 30 is excluded from the data set, is 1+2+3+4= 10/4 = 2.5. With the outlier, however, the mean comes out to be 10+30/5 = 8, which is significantly higher than 2.5. There are two ways of treating outliers – either normalize it by bringing it closer to other values or eliminate it altogether to prevent them from misrepresenting the data set.
For more on the role of mean, median, mode in data science and how to find them using Python, check this link.

7) Explain feature vectors.

Feature vectors are recognizable attributes of objects. In machine learning, they enable statistical procedures such as linear regressions that rely on explanatory variables to predict outcomes. For example, the – colors red, green and blue (RGB) – are feature vectors in all colors. They are explanatory variables that allow us to predict and explain accurately how each color is different from another.

8) Explain root-cause analysis.

Root causes analysis is a problem-solving technique that tries to get to the heart of errors in processes. It is a method of prevention rather than of cure. It tries to preempt faults and avoid them.

9) Explain cross-validation.

Cross-validation is a technique to check if predictive models are effective in analysis of data. It is a way to ensure that the model used to predict outcomes is effective in achieving its objectives.

10) What is K-means clustering?

K-means is a clustering algorithm. It groups similar-seeming data into distinct clusters. It is useful for programs like search engines that can throw up numerous results for any search term. For example, a search for “uber” could potentially display results for the taxi service company, food that the same company delivers, or quite simply dictionaries that define the meaning of the word. Using this algorithm, search engines can display all pages on Uber cabs once it figures out you’re looking for information about the taxi service.

11) Explain power analysis.

Power analysis is carried out to test the strength of experimental design. It tries to verify whether the results of an experiment on a sample set can be generalized to predict patterns in larger data sets.

12) What is selection bias?

Selection bias is an attitude of the researcher or research organization that prevents proper randomization of sample set in experimental design. It usually misrepresents the population under scrutiny to distort and produce inaccurate findings. For example, if a researcher wants to prove that total rainfall in a year is higher than it is, he may choose to record more findings during the rainy season than at other times to increase the annual average.

13) Give a few examples of machine learning from the real world.

Machine learning is used for a variety of purposes like making recommendations on websites to maintaining records of customers or assessing opportunities or problems. At Netflix, it is used to make movie recommendations. At Amazon, it is used to predict and provide the products that customers need.

14) What are false negatives and false positives?

False negatives and false positives are errors in predicting outcomes. A fake negative wrongly identifies a negative trait. For instance, in fraud detection, a non-fraud may be wrongly identified as a fraud. Fake positives falsely recognize a positive trait in objects. For example, when the bank gives out credit it might wrongly assess the ability of the borrower to pay back and create bad debts.

15) What are decision trees?

Decision trees are a type of algorithm that is used to classify information and predict all possible outcomes according to classifications. For example, the answer to the question “Are you a data scientist?” could either be yes or no. If the answer is yes, we can use this algorithm to list all possible tasks the data scientist engages in to find out what tasks are most popular. If the answer is no, the algorithm could present a list of other occupations to determine what the individual does for a living.

Bossing the Interview

The questions in this blog are only some of the data science faqs at job interviews. They should help you start thinking the kind of questions that employers can ask and prepare for them. The list is by no means comprehensive, however, since it only covers 15 questions. Feel free to add to this list in the comments section. It is important to remember that interviews are all about demonstrating abilities and skills as much as it is about showing theoretical understanding. Make sure you include examples and first hand experiences in your answers to establish the practical uses of hiring you and leave the rest, as they say, to fate. Here’s wishing you all the best for the interview.
To learn more about big data and data science, subscribe to this blog or visit Acadgild.

Suggested Reading

Data Science Programming

A Day in the Life of a Data Scientist

Data Science Mistakes You Should Avoid

Tags

Rohan Kumar

First-gen Rohantosh. Admirer and critic of all things tech.

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close