Data mining is an automated process of pattern discovery in large data sets. It relies on mathematical and statistical algorithms to not just categorize data into different types, but also to judge the likelihood of an event occurring in the future. Simply put, data mining is the process of gaining intelligence from data that can be used to inform decisions.
Features of Data Mining
Data mining is an advance form of analytics. It is different from simple data analysis in the following ways:
It is automatic: data mining requires predictive models that work. These models are generally created using sample data sets, which can be generalized to new input data. When any such model is used to analyze new data, it is known as scoring. When a data mining model is not trained on a sample set, it may be used for other purposes such as grouping similar data points to identify characteristics of clusters of data.
It works on voluminous data: data mining is a process that is catered for the analysis of BIG data. Due to its automatic nature, it can comfortably sift through large amounts of data to recognize trends and patterns that would otherwise by hidden in simple data analysis.
It makes predictions: data mining is inherently a process of making predictions. The predictions may foresee an event or a correlation between two or more factors influencing a data set. The likelihood of a prediction occurring is generally quantified and labelled as the confidence. Correlations generally follow rules that find substantial support in the data.
How It’s Different from Statistics
Data mining is like statistics in the sense that it uses data to draw inferences. However, statistics uses smaller sample sets to generalize about a larger group or population. Data mining uses large data sets to create predictive models that are more accurate. Statistical techniques often require interaction with members of the sample set. They use computers for information processing not automation. Data mining generally does not involve interacting with the sample set. And, data mining is an automatic process.
How It’s Different from Data Warehousing
Data warehousing precedes data mining. It is the process of cleaning and preparing data for analysis. Most importantly, data warehousing is the process of storing data in a form that is suitable for data analysis. Data mining, on the other hand, is the process of building models that can perform the analysis on the data.
Relation to OLAP
OLAP stands for Online Analytical Processing. It is an approach used to quickly analyze different dimensions of a data set. OLAP supports data mining as it is useful in summarizing data, what-if analysis, time series analysis, etc. It lacks the ability to draw inductive inferences however – the ability to draw general conclusions using specific examples. OLAP is used by organizations looking for a multi-dimensional view of data as it is especially useful in drawing hierarchies. Data mining has no regard for either dimensions or hierarchies.
For these reasons, OLAP and data mining are complementary ways of analyzing data that can be integrated for better results. Data mining can be used to identify someone, who is most likely to drop out or succeed in a course and OLAP can be used to understand what separates these groups perhaps.
The Limits of Data Mining
Data mining is useful in finding trends and correlations from a large data set. Nonetheless, it will be useless without proper understanding of business objectives, business domain, and statistical analysis. Data mining can only identify what is important, not how it is or how you can make it. Data mining has its limits. It might be useful in determining, who is most likely to excel in your study program. But if the student is not committed, it cannot make the student excel because of its prediction.