Sunday, April 29, 2012

1. Introduction to KDD and DM

The first lecture on Knowledge Discovery in Databases (KDD) is about motivation and introducing the different concepts: 
  1. Why KDD?
  2. Application examples
  3. What is KDD?
  4. What is Data Mining (DM)?
  5. Supervised vs unsupervised learning
  6. Main DM tasks

Question 1 is a very easy one. There are so many data nowadays (quantity) and in so many different forms like text, audio, images, video, etc (complexity) that it is impossible to exploit them manually. So, methods are required which help us to extract "knowledge" from such kind of data in an (semi) automatic way. Here comes the role of Knowledge Discovery in Databases (KDD) and Data Mining (DM).


Retail industry, telecommunications, health care systems, WWW, science, business are some of the examples where KDD is applied (Question 2).

KDD is a process (Question 3), that is, it consists of several steps (see the picture below):
  1. Selection of data which are relevant to the analysis task
  2. Preprocessing of these data, including tasks like data data cleaning and data integration
  3. Transformation of the data into forms appropriate for mining
  4. Application of Data Mining algorithms for the extraction of patterns
  5. Interpretation/evaluation of the generated patterns so as to identify those patterns that represent real knowledge, based on some interestingness measures.


If, at some step, you realize some error or "strange" results you can go back to a previous step in the process and revise it (thus why the feedback loops in the picture). 

The KDD process was introduced in the following paper:

Usama Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

The Data Mining (DM) step  (Question 4)  is one of the core steps of the KDD process. Its goal is to apply data analysis and knowledge discovery algorithms that produce a particular enumeration of patterns (or models) over the data. 
Although DM its the most "famous" step and sometimes the whole procedure is called DM instead of KDD, it heavily relies on the other steps. I mean, if your preprocessing "sucks", the mining results will also "suck".

The result of this step are the patterns or data mining models
One can think of patterns as comprehensive summaries of the data or as higher level description of the data. Different types of patterns exist, like: clusters, decision trees, association rules, frequent itemsets, sequential patterns.

(Question 5There are two ways of learning from data: the supervised way and the unsupervised way.
  • Supervised learning: 
there are given some examples/ instances of the problem where the true classes/labels of the data are given. 
For example, in a medical application you have a set of people for which you know whether they have some specific disease (class: yes) or not (class: no). The goal is to build a model for each of the different classes, so as, for any a new instance for which you have no clue about its class, to predict its class.
In our example, for a new person the model should decide whether he/she suffers from the specific disease (class: yes) or not (class: no).
  • Unsupervised learning: 
there is no apriori knowledge in the instances about the right answer. Based on the instance characteristics, the instances are organized into groups of similar instances.
For example, in a news aggregation site like google news, the news posts are organized into groups of news posts referring to the same topic. To do so, the similarity between the news posts is employed.

(Question 6) The main DM tasks are:
  • Clustering
  • Classification
  • Association rules mining
  • Regression
  • Outlier detection
We will explain each of them in detail in later posts. 

Hello again

My first hello post was back in 2009. 
At that time, I wanted to blog about data mining related issues but for some reason I lost my motivation quite fast :-(
I decided to restart now. I am doing a data mining lecture this semester so I think its a good motivation to restart.