Showing posts with label Data Mining. Show all posts
Showing posts with label Data Mining. Show all posts

Sunday, November 3, 2013

2. Data Preprocessing

Before applying any mining algorithm, one should prepare the data so as to be suitable for the algorithm. This involves some kind of transformation of the data in a suitable for the algorithm format and also, dealing with problems in data like noise and outliers.
In many applications there exist raw data that are problematic, that is, their quality is low. There are different reasons for this low quality, e.g.:
  • Noisy data: e.g. an age value of 1000 is not valid. 
  • Incomplete data: missing values or even missing important for the analysis attributes 
  • Inconsistent data: e.g. different movie ratings between different databases, one might use the 1-5 range and the other the 1-10 range. If we want to analyze data from both systems, we should find some kind of mapping between the two ranges.
It is obvious that if we rely upon "dirty data", the mining results will be of low quality. The better the quality of the raw data, the better the quality of the mining results. 
So, preprocessing is a very important step in the mining process, which takes usually a lot of time, depending of course on the application per se and on the quality of the original/ raw data.

Major tasks in data preprocessing

  • Data cleaning:  e.g., fill in missing values, remove outliers
  • Data integration: "combine" data from different sources. Issues to be solved: entity resolution (which feature of one source maps to which feature of the other source), value resolution (like the different ranges problem described above).
  • Data transformation: e.g. normalize the data in a given range (usually 0-1), generalize (e.g. from the x,y coordinates level to the city level).
  • Data reduction: aggregation (e.g. instead of 12 values for the salary attribute (1 per month) of a person, just keep the avg salary per month), dimensionality reduction (remove redundant dimensions, e.g. no need to use both birthday and age for the analysis), duplicate elimination. 

Features

Although there are different types of data (numerical, text, images, videos, graph, ... ), the analysis relies upon features extracted from these data. So, the mining methods are somehow global since they are applied over features, extracted from different application domains.
You can think of the features as the columns/fields of a table in a database. So,  the features are the properties/ characteristics of the data.

Depending on the application, feature extraction might be a challenging step itself. For example, there is a lot of work on extracting features from text data (e.g., through TFIDF) or from images (e.g. color histograms).

There are different types of features/attributes:
  • binary, e.g., smoker (yes/no), glasses(yes/no),  gender(male/female)
  • categorical or nominal, e.g., eye color (brown, blue, green), occupation (engineer, student, teacher,...) 
  • ordinal, e.g. medal(bronze, silver, gold)
  • numerical, e.g. age, income, height
Numerical data are the most common. The numerical values might be the original values (e.g. x,y coordinates in a spatial application) or the result of some transformation (e.g. TFIDF values in text data).

Univariate Feature/Attribute descriptors
There are different measures to "characterize" single attributes. The reason we use these descriptors is to understand the distribution of the attribute values and clean the data by e.g. removing outliers.
For numerical features, the most common ones are: 
mean/avg, median, mode (all these are measures of central tendency of the attribute) and range, quantiles, standard deviation, variance (all these are measures of dispersion of the attribute).

Bivariate Feature/Attribute descriptors
There are different measures to "characterize" the correlation between two attributes. The reason we use these descriptors is to find possible redundant attributes and do not consider both for the analysis.
For numerical data, the most common measure is correlation coefficient.
For categorical data, χ^2(chi-square).

1. Knowledge Discovery in Databases (KDD) and Data Mining (DM)

More and more data of different types (text, audio, images, videos,...) are collected nowadays from different data sources (telecommunication, science, business, health-care systems, WWW,...).

Due to their quantity and complexity, it is impossible for humans to exploit these data collections through some manual process. Here comes the role of Knowledge Discovery in Databases (KDD), which aims at discovering knowledge hidden in vast amounts of data.

The KDD process consists of the following steps (see the picture below):
  1. Selection of data which are relevant to the analysis task
  2. Preprocessing of these data, including tasks like data data cleaning and data integration
  3. Transformation of the data into forms appropriate for mining
  4. Application of Data Mining algorithms for the extraction of patterns
  5. Interpretation/evaluation of the generated patterns so as to identify those patterns that represent real knowledge, based on some interestingness measures.


The Data Mining (DM) step is one of the core steps of the KDD process.
Its goal is to apply data analysis and knowledge discovery algorithms that
produce a particular enumeration of patterns (or models) over the data.

The KDD process was introduced in the following paper:

Usama Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth (1995), “From Knowledge Discovery to Data Mining:  An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

What do we mean by the term patterns or data mining models?
One can think of patterns as comprehensive summaries of the data or as higher level description of the data. Different types of patterns exist, like: clusters, decision trees, association rules, frequent itemsets, sequential patterns.

There are a lot of informative resources on DM and KDD that one can user for further reading. I appose some of them below:

Thursday, August 23, 2012

Free online data mining books

For the interested reader, there are several data mining books available online for free. I appose some of them below:
The emphasis in the book by Rajanaman, Leskovec and Ullman is on mining of very large amounts of data, i.e., data that do not fit in main memory. WWW is a source of such data and it is used quite often as an example in this book.

Below is the table of contents:
  1. Data Mining
  2. Large-Scale File Systems and Map-Reduce
  3. Finding Similar Items
  4. Mining Data Streams
  5. Link Analysis
  6. Frequent Itemsets
  7. Clustering
  8. Advertising on the Web
  9. Recommendation Systems
  10. Mining Social-Network Graphs

Sunday, April 29, 2012

1. Introduction to KDD and DM

The first lecture on Knowledge Discovery in Databases (KDD) is about motivation and introducing the different concepts: 
  1. Why KDD?
  2. Application examples
  3. What is KDD?
  4. What is Data Mining (DM)?
  5. Supervised vs unsupervised learning
  6. Main DM tasks

Question 1 is a very easy one. There are so many data nowadays (quantity) and in so many different forms like text, audio, images, video, etc (complexity) that it is impossible to exploit them manually. So, methods are required which help us to extract "knowledge" from such kind of data in an (semi) automatic way. Here comes the role of Knowledge Discovery in Databases (KDD) and Data Mining (DM).


Retail industry, telecommunications, health care systems, WWW, science, business are some of the examples where KDD is applied (Question 2).

KDD is a process (Question 3), that is, it consists of several steps (see the picture below):
  1. Selection of data which are relevant to the analysis task
  2. Preprocessing of these data, including tasks like data data cleaning and data integration
  3. Transformation of the data into forms appropriate for mining
  4. Application of Data Mining algorithms for the extraction of patterns
  5. Interpretation/evaluation of the generated patterns so as to identify those patterns that represent real knowledge, based on some interestingness measures.


If, at some step, you realize some error or "strange" results you can go back to a previous step in the process and revise it (thus why the feedback loops in the picture). 

The KDD process was introduced in the following paper:

Usama Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

The Data Mining (DM) step  (Question 4)  is one of the core steps of the KDD process. Its goal is to apply data analysis and knowledge discovery algorithms that produce a particular enumeration of patterns (or models) over the data. 
Although DM its the most "famous" step and sometimes the whole procedure is called DM instead of KDD, it heavily relies on the other steps. I mean, if your preprocessing "sucks", the mining results will also "suck".

The result of this step are the patterns or data mining models
One can think of patterns as comprehensive summaries of the data or as higher level description of the data. Different types of patterns exist, like: clusters, decision trees, association rules, frequent itemsets, sequential patterns.

(Question 5There are two ways of learning from data: the supervised way and the unsupervised way.
  • Supervised learning: 
there are given some examples/ instances of the problem where the true classes/labels of the data are given. 
For example, in a medical application you have a set of people for which you know whether they have some specific disease (class: yes) or not (class: no). The goal is to build a model for each of the different classes, so as, for any a new instance for which you have no clue about its class, to predict its class.
In our example, for a new person the model should decide whether he/she suffers from the specific disease (class: yes) or not (class: no).
  • Unsupervised learning: 
there is no apriori knowledge in the instances about the right answer. Based on the instance characteristics, the instances are organized into groups of similar instances.
For example, in a news aggregation site like google news, the news posts are organized into groups of news posts referring to the same topic. To do so, the similarity between the news posts is employed.

(Question 6) The main DM tasks are:
  • Clustering
  • Classification
  • Association rules mining
  • Regression
  • Outlier detection
We will explain each of them in detail in later posts.