The first lecture on Knowledge Discovery in Databases (KDD) is about motivation and introducing the different concepts:
- Why KDD?
- Application examples
- What is KDD?
- What is Data Mining (DM)?
- Supervised vs unsupervised learning
- Main DM tasks
Question 1 is a very easy one. There are so many data nowadays (quantity) and in so many different forms like text, audio, images, video, etc (complexity) that it is impossible to exploit them manually. So, methods are required which help us to extract "knowledge" from such kind of data in an (semi) automatic way. Here comes the role of Knowledge Discovery in Databases (KDD) and Data Mining (DM).
Retail industry, telecommunications, health care systems, WWW, science, business are some of the examples where KDD is applied (Question 2).
KDD is a process (Question 3), that is, it consists of several steps (see the picture below):
- Selection of data which are relevant to the analysis task
- Preprocessing of these data, including tasks like data data cleaning and data integration
- Transformation of the data into forms appropriate for mining
- Application of Data Mining algorithms for the extraction of patterns
- Interpretation/evaluation of the generated patterns so as to identify those patterns that represent real knowledge, based on some interestingness measures.
If, at some step, you realize some error or "strange" results you can go back to a previous step in the process and revise it (thus why the feedback loops in the picture).
The Data Mining (DM) step (Question 4) is one of the core steps of the KDD process. Its goal is to apply data analysis and knowledge discovery algorithms that produce a particular enumeration of patterns (or models) over the data.
Although DM its the most "famous" step and sometimes the whole procedure is called DM instead of KDD, it heavily relies on the other steps. I mean, if your preprocessing "sucks", the mining results will also "suck".
The result of this step are the patterns or data mining models:
One can think of patterns as comprehensive summaries of the data or as higher level description of the data. Different types of patterns exist, like: clusters, decision trees, association rules, frequent itemsets, sequential patterns.
(Question 5) There are two ways of learning from data: the supervised way and the unsupervised way.
there are given some examples/ instances of the problem where the true classes/labels of the data are given.
For example, in a medical application you have a set of people for which you know whether they have some specific disease (class: yes) or not (class: no). The goal is to build a model for each of the different classes, so as, for any a new instance for which you have no clue about its class, to predict its class.
In our example, for a new person the model should decide whether he/she suffers from the specific disease (class: yes) or not (class: no).
there is no apriori knowledge in the instances about the right answer. Based on the instance characteristics, the instances are organized into groups of similar instances.
For example, in a news aggregation site like google news, the news posts are organized into groups of news posts referring to the same topic. To do so, the similarity between the news posts is employed.
(Question 6) The main DM tasks are:
- Clustering
- Classification
- Association rules mining
- Regression
- Outlier detection
We will explain each of them in detail in later posts.