This is a series of articles, tutorials and guide for Data Mining and Warehousing (DM&W). People who are starting out in the world of Data Science should find all the topics easy to understand. Students of B.Tech and other engineering courses can use the articles as a source of notes that cover the entire syllabus of most DM&W courses. Professionals and enthusiasts can learn or brush up their basics.
No matter what you intend to ‘mine’ from this series, I present all the concepts in a concise and conversational manner.
Without further delay, let’s now dive straight into it
What is Data Mining?
Data Mining is the process of discovering interesting patterns from massive amounts of data. But, raw data is useless until we extract or ‘mine’ something useful out of it.
Our aim is to derive knowledge from the data. So it is also considered as knowledge discovery from data, or KDD for short.
Data Mining is a growing field and has applications in many domains. It provides a wide range of career opportunities. Because of the latest boom in Data Science, Machine Learning, etc. there is demand for people with knowledge in this subject.
What does Knowledge Discovery from Data (KDD) involve?
A knowledge discovery process involves the following –
- Cleaning of Data – process of noise and inconsistent data removal.
- Integration of Data – process of combining data from multiple sources.
- Selection of Data – process of retrieving and selecting relevant data for analysis.
- Transformation of Data – consolidation and transformation of data into appropriate forms for mining.
- Pattern Discovery (Data Mining) – application of various methods and algorithms to extract patterns from the processes data.
- Evaluation of Pattern – identifying the truly interesting patterns that represent knowledge as a useful measure.
- Knowledge Presentation – visualization and representation of the acquired knowledge.
Processes 1 to 4 form a task called Data Preprocessing. Here we prepare the data for mining. Thus, we realize that Data Mining is not a synonym for Knowledge Discovery, but an intermediate step. It is a common tendency to use both the terms interchangeably but it is always good to understand the broader meaning of things.
With this definition in mind, we move ahead to the functionalities of Data Mining.
Functionalities of Data Mining
Data Mining may yield insight for the following functionalities –
A summarization of general characteristics or features of a target class of data.
For example, we can profile the students of a university and produce their characteristics such as their ‘stream (CSE, EECE, etc)’, ‘gpa (high, low)’, ‘year of study’, ‘number of courses taken’, and so on.
A comparison of the general features of the target class data objects against the general features of objects from one or more contrasting classes.
For example, the general features of students with high gpa may be compared against the general features of students with low gpa.
The resulting description could be a general comparative profile of the students such as – ‘75% of students with high gpa are 4th year CSE students, while 65% of the students with low gpa are not 4th year CSE students.’
The discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data. For example,
Rule: major(X, “CSE”) => owns(X, “personal computer”) ; X is some student. Support = 12% and confidence = 98%.
The above rule indicates that out of the students under study, 12% (support) of the students who major in CSE, own a personal computer.
There is a 98% probability (confidence or certainty) of this claim.
We will discuss concepts of support and confidence in more detail in articles on Association Rule Mining and its algorithms.
Used for predicting the class label of data objects. However it differs from the term prediction.
‘Classification’ builds a set of models or functions that describe and distinguish data classes or concepts.
‘Prediction’ builds a model to predict some missing or unavailable data value.
The difference is there in the name itself. Classification deals with class labels. Whereas Prediction deals with missing or unavailable data.
It is the analysis of data objects without consulting a known class label. The objects are clustered or grouped based on the principle of maximizing intra-class similarities and minimizing the inter-class similarities.
In other words, items within a cluster must be similar to each other, but dissimilar to items in other clusters. Each cluster formed can be viewed as a class of objects of itself.
These are the 5 main functionalities of Data Mining. These can be further divided into more tasks but this article is only presenting an overview of these processes.
I hope you found the article helpful and have clarity over the different terms used.
You can also read this book – Data Mining: Concepts and Techniques by by Jiawei Han, Jian Pei and Micheline Kamber.
As always, let me know if you have any questions or queries in the comment section below.