Sunday, November 3, 2013

2. Data Preprocessing

Before applying any mining algorithm, one should prepare the data so as to be suitable for the algorithm. This involves some kind of transformation of the data in a suitable for the algorithm format and also, dealing with problems in data like noise and outliers.
In many applications there exist raw data that are problematic, that is, their quality is low. There are different reasons for this low quality, e.g.:
  • Noisy data: e.g. an age value of 1000 is not valid. 
  • Incomplete data: missing values or even missing important for the analysis attributes 
  • Inconsistent data: e.g. different movie ratings between different databases, one might use the 1-5 range and the other the 1-10 range. If we want to analyze data from both systems, we should find some kind of mapping between the two ranges.
It is obvious that if we rely upon "dirty data", the mining results will be of low quality. The better the quality of the raw data, the better the quality of the mining results. 
So, preprocessing is a very important step in the mining process, which takes usually a lot of time, depending of course on the application per se and on the quality of the original/ raw data.

Major tasks in data preprocessing

  • Data cleaning:  e.g., fill in missing values, remove outliers
  • Data integration: "combine" data from different sources. Issues to be solved: entity resolution (which feature of one source maps to which feature of the other source), value resolution (like the different ranges problem described above).
  • Data transformation: e.g. normalize the data in a given range (usually 0-1), generalize (e.g. from the x,y coordinates level to the city level).
  • Data reduction: aggregation (e.g. instead of 12 values for the salary attribute (1 per month) of a person, just keep the avg salary per month), dimensionality reduction (remove redundant dimensions, e.g. no need to use both birthday and age for the analysis), duplicate elimination. 

Features

Although there are different types of data (numerical, text, images, videos, graph, ... ), the analysis relies upon features extracted from these data. So, the mining methods are somehow global since they are applied over features, extracted from different application domains.
You can think of the features as the columns/fields of a table in a database. So,  the features are the properties/ characteristics of the data.

Depending on the application, feature extraction might be a challenging step itself. For example, there is a lot of work on extracting features from text data (e.g., through TFIDF) or from images (e.g. color histograms).

There are different types of features/attributes:
  • binary, e.g., smoker (yes/no), glasses(yes/no),  gender(male/female)
  • categorical or nominal, e.g., eye color (brown, blue, green), occupation (engineer, student, teacher,...) 
  • ordinal, e.g. medal(bronze, silver, gold)
  • numerical, e.g. age, income, height
Numerical data are the most common. The numerical values might be the original values (e.g. x,y coordinates in a spatial application) or the result of some transformation (e.g. TFIDF values in text data).

Univariate Feature/Attribute descriptors
There are different measures to "characterize" single attributes. The reason we use these descriptors is to understand the distribution of the attribute values and clean the data by e.g. removing outliers.
For numerical features, the most common ones are: 
mean/avg, median, mode (all these are measures of central tendency of the attribute) and range, quantiles, standard deviation, variance (all these are measures of dispersion of the attribute).

Bivariate Feature/Attribute descriptors
There are different measures to "characterize" the correlation between two attributes. The reason we use these descriptors is to find possible redundant attributes and do not consider both for the analysis.
For numerical data, the most common measure is correlation coefficient.
For categorical data, χ^2(chi-square).

1. Knowledge Discovery in Databases (KDD) and Data Mining (DM)

More and more data of different types (text, audio, images, videos,...) are collected nowadays from different data sources (telecommunication, science, business, health-care systems, WWW,...).

Due to their quantity and complexity, it is impossible for humans to exploit these data collections through some manual process. Here comes the role of Knowledge Discovery in Databases (KDD), which aims at discovering knowledge hidden in vast amounts of data.

The KDD process consists of the following steps (see the picture below):
  1. Selection of data which are relevant to the analysis task
  2. Preprocessing of these data, including tasks like data data cleaning and data integration
  3. Transformation of the data into forms appropriate for mining
  4. Application of Data Mining algorithms for the extraction of patterns
  5. Interpretation/evaluation of the generated patterns so as to identify those patterns that represent real knowledge, based on some interestingness measures.


The Data Mining (DM) step is one of the core steps of the KDD process.
Its goal is to apply data analysis and knowledge discovery algorithms that
produce a particular enumeration of patterns (or models) over the data.

The KDD process was introduced in the following paper:

Usama Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth (1995), “From Knowledge Discovery to Data Mining:  An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

What do we mean by the term patterns or data mining models?
One can think of patterns as comprehensive summaries of the data or as higher level description of the data. Different types of patterns exist, like: clusters, decision trees, association rules, frequent itemsets, sequential patterns.

There are a lot of informative resources on DM and KDD that one can user for further reading. I appose some of them below: