There are two ways of improving at any skill – learning what to do and learning what not to do. If you want to get really good at a skill, you’ll probably want to do both. If you’re interested in becoming a good data scientist, this blog can help you achieve at least one of the aforementioned objectives. Here are fifteen data science mistakes that you can easily avoid.
15 Data Science Mistakes That Are Common
Quite simply, being dishonest with data. We all know what cherry-picking means – choosing what you like and ignoring the rest. It is very important to be objective and unbiased towards your data, if you want to achieve results. Safe to say, cherry-picking is not a good idea. It can misrepresent the facts of the problem that you are working on and cause you to device a faulty solution. Believe me when I say – honest is the best policy in matters of data.
Free Step-by-step Guide To Become A Data Scientist
Subscribe and get this detailed guide absolutely FREE
It is common for people – especially ones with a vague understanding of how data science works – to assume that data analysis means picking out obvious correlations from a variety of data. This is, however, not entirely true. Data analysis requires logical reasoning that explains why correlations exist in data. Without proper explanation, there remains a possibility of chance correlations in data. In such cases, the correlations that are made do not reveal actionable insights – only changeable facts that cannot be tested or established in subsequent research. Data dredging is the practice of making chance correlations that fall outside the purview of initial hypothesis without offering any insights into the reasons for correlation.
An off-shoot of the data dredging fallacy, false causality is a wrong assumption about correlations that could lead to failure in research. It is important to remember from this mistake and the previous one, that not all correlations are revealing. A data scientist must always dig deeper than what is apparent on the surface and go beyond simple correlations to gather evidence of reason in research.
The mistake that comes with an anecdote. Who can’t learn from a story, right? When the Britishers wanted to get rid of Cobras in India, they offered the local population a reward for bringing cobra skins. The result? The local population started farming cobras to get incentives, thereby increasing the number of snakes in the region. Cobra effect is the counter-productive use of incentives in research.
Drawing conclusions from incomplete data. Presumptions play a huge role in data analysis. More specifically, they play a crucial role in making data analysis inaccurate. The reason? Presumptions make data scientists blind to aspects of data that are relevant, albeit not in the data scientist’s eyes. Hence, it is important in research to think about where your answers could potentially lie beyond the data that survives your initial impulse.
A practice that is extremely detrimental to election results – gerrymandering is the manipulating of geographic boundaries used to group data with the intention of skewing results in a direction. The principle may be extended to any practice that manipulates the process of segregating data to achieve a result that the data scientist desires.
A mistake that, as the name suggests, is inherent to the sample that a data scientist studies. Sampling bias is the drawing of conclusions from a data set that does not accurately represent the population under study. Needless to say, that without adequate representation of all sections of the population, it would be difficult to achieve correct results or insights from data.
The idea that the frequency of an event’s occurring determines the probability of it occurring in the future. In a card game like poker, it is wrong for the player to assume at any moment in the game that because Jacks have already made substantial appearance, they won’t or will continue to in the rest of the game. Similarly, in research the data scientist must go beyond frequency of events to try and understand the reasons behind these simple numbers.
Also known as the Observer Effect is an effect that researchers have on human subjects in the sample. Why is it called the Hawthorne effect? Because a researcher at Hawthorne Works discovered the effect while studying the effect of working environment on workers in 1920. He discovered that the researchers had more impact than any factor in the work environment on the productivity of the workers.
There is a tendency for many kinds of data to naturally regress to the average or mean. For instance, Mumbai might experience heavy rainfall for two or three days in a row after which the rain might subside to the city average of “normal” rainy days. If, when this happens, a data scientist attributes this regression to the mean to, say, a bureaucrat’s actions during the heavy rainfall days or any other, then it would be a fallacy as the rainfall was bound to regress naturally. Data scientists must always be mindful of the exceptions – whether or good or bad – during analysis before jumping to conclusions about them.
From data science mistakes like the sampling bias and gerrymandering, it is clear that how you group or select data has a crucial impact on the results you achieve. The Simpson’s paradox is in line with this idea as it claims that data can be paradoxical when groups of data are combined or separated. For instance, women may have better acceptance rates than men in individual courses of a university, but still have worse acceptance rates in the whole university. This is because averages can be very misleading and silly.
Data scientists can have a bad habit of obsessing over data. This could sometimes cause them to lose sight of the larger picture. If they do, they would be making the McNamara fallacy. This mistake is named after a US Secretary of Defence – Robert McNamara – whose uncompromising belief that the number of enemy dead bodies denoted how successful a war was, caused him to ignore other factors such as public opinion that eventually had a telling on the outcome of the Vietnam war.
Data science uses mathematical and statistical models to make correlations in data. Generally, complex models fit the data better. Having said that, they tend to be brittle. Simple models tend to be more robust and better at making predictions about the future. When data scientists use complex models, they run the risk of overfitting them to solve their data problem.
Every researcher, scientist or writer would like to get published. And what gets published is what is interesting – what stands out. Data scientists might conduct similar experiments for different organizations or research projects. Nonetheless, not all of them achieve the same level of success. Given the prestige up for grabs, data scientists may be temped to have a bias towards what has the potential to get published. This, however, is a cardinal sin that all data scientists must avoid.
Relying only on Summary Metrics
In 1970s, a statistician – Francis Anscombe – became famous for demonstrating that the summaries of data sets can be very similar despite having very different values in each of the sets. He used four data sets known as the Anscombe Quartet to demonstrate that summary metrics were inefficient during data analysis for this reason. The Anscombe Quartet had the same mean and variance but looked very different in graphs.
Doing Data Science Right
They say, a penny saved is a penny earned. By the same logic, a mistake avoided is a step in the right direction. I hope this blog will help you avoid 15 data science mistakes to put you in the right direction. Wish you all the best and happy learning.