Before applying any mining algorithm, one should prepare the data so as to be suitable for the algorithm. This involves some kind of transformation of the data in a suitable for the algorithm format and also, dealing with problems in data like noise and outliers.
In many applications there exist raw data that are problematic, that is, their quality is low. There are different reasons for this low quality, e.g.:
In many applications there exist raw data that are problematic, that is, their quality is low. There are different reasons for this low quality, e.g.:
- Noisy data: e.g. an age value of 1000 is not valid.
- Incomplete data: missing values or even missing important for the analysis attributes
- Inconsistent data: e.g. different movie ratings between different databases, one might use the 1-5 range and the other the 1-10 range. If we want to analyze data from both systems, we should find some kind of mapping between the two ranges.
It is obvious that if we rely upon "dirty data", the mining results will be of low quality. The better the quality of the raw data, the better the quality of the mining results.
So, preprocessing is a very important step in the mining process, which takes usually a lot of time, depending of course on the application per se and on the quality of the original/ raw data.
Major tasks in data preprocessing
- Data cleaning: e.g., fill in missing values, remove outliers
- Data integration: "combine" data from different sources. Issues to be solved: entity resolution (which feature of one source maps to which feature of the other source), value resolution (like the different ranges problem described above).
- Data transformation: e.g. normalize the data in a given range (usually 0-1), generalize (e.g. from the x,y coordinates level to the city level).
- Data reduction: aggregation (e.g. instead of 12 values for the salary attribute (1 per month) of a person, just keep the avg salary per month), dimensionality reduction (remove redundant dimensions, e.g. no need to use both birthday and age for the analysis), duplicate elimination.
Features
Although there are different types of data (numerical, text, images, videos, graph, ... ), the analysis relies upon features extracted from these data. So, the mining methods are somehow global since they are applied over features, extracted from different application domains.
You can think of the features as the columns/fields of a table in a database. So, the features are the properties/ characteristics of the data.
You can think of the features as the columns/fields of a table in a database. So, the features are the properties/ characteristics of the data.
Depending on the application, feature extraction might be a challenging step itself. For example, there is a lot of work on extracting features from text data (e.g., through TFIDF) or from images (e.g. color histograms).
There are different types of features/attributes:
- binary, e.g., smoker (yes/no), glasses(yes/no), gender(male/female)
- categorical or nominal, e.g., eye color (brown, blue, green), occupation (engineer, student, teacher,...)
- ordinal, e.g. medal(bronze, silver, gold)
- numerical, e.g. age, income, height
Numerical data are the most common. The numerical values might be the original values (e.g. x,y coordinates in a spatial application) or the result of some transformation (e.g. TFIDF values in text data).
Univariate Feature/Attribute descriptors
There are different measures to "characterize" single attributes. The reason we use these descriptors is to understand the distribution of the attribute values and clean the data by e.g. removing outliers.
For numerical features, the most common ones are:
mean/avg, median, mode (all these are measures of central tendency of the attribute) and range, quantiles, standard deviation, variance (all these are measures of dispersion of the attribute).
Bivariate Feature/Attribute descriptors
There are different measures to "characterize" the correlation between two attributes. The reason we use these descriptors is to find possible redundant attributes and do not consider both for the analysis.
For numerical data, the most common measure is correlation coefficient.
For categorical data, χ^2(chi-square).
For numerical data, the most common measure is correlation coefficient.
For categorical data, χ^2(chi-square).