Sunday, May 24, 2015

Recommendations

The today post is about recommendations, which seem to be everywhere nowadays from online shopping stores, to news sites, movies and music sites. They became so popular because of the vast amount of offered choices out there; so many clothes to wear, so many books to read, so many movies to watch. Going through all these choices one by one is impossible for the end user. 

A solution for the user would be  filtering of the offered items based on criteria like author in case of a book, or director in case of a film.  Filtering is one way to restrict the selection to fewer items but the quality of the results really depends on the filtering itself. A bad filtering might result in user missing some interesting items. Moreover filtering doesn’t consider user preferences as “captured” by his/her previous “behavior”. And also, filtering requires a lot of effort from the user side, as the user should specify what are the filtering criteria.

Recommendations help in this direction in the sense that they exploit the customer “past behavior” in order to “infer” what he/she might be interested in currently. The inference is a key factor here, a recommender tries to guess what a user might like indirectly from his "past behavior". In that sense, the user involvement is limited. 

Before proceeding with how recommendations are issued, let's introduce a more formal problem description. The best way to "visualize" the problem, is by considering a matrix R with |U| rows and |I| columns. The rows represent the users, whereas the columns the items of our recommendation application. Each cell r_{ij} in this matrix indicates the rating of the corresponding user u_i \in U for the item i_j\in I. Obviously, not all items are rated by users; the corresponding entries in the matrix are null. Actually, most of the entries are null because typically there are many many items in a recommendation applications and users rate only a few of them. This is known as the sparsity problem in recommendations. 

There is a large variety of recommendation systems, from collaborative filtering to user models. We will now focus on collaborating filtering as a way of inferring recommendations for a user.

Collaborative filtering

In collaborative filtering, the recommendations to a query user u are based on other similar users. These are users which exhibit similarity to the query user w.r.t. their purchases/ preferences. The idea is that if I as a user have similar purchase behavior to you then items that you have bought, but I don’t so far, can be of potential interest to me.

So the critical part is how to locate such similar users. This is done through some similarity function that evaluates how similar two users are based on their purchased items and ratings upon these items. Cosine similarity and Pearson Correlation comprise popular similarity functions choices.

Once the similar users to the query user are identified, for each item i unrated by the query user a prediction is made using the corresponding ratings of his similar users that have rated the specific item. The user similarity is used as a weight upon the rating, so as more similar users contribute more comparing to less similar ones. That is, for each similar user u' of u, the rating of u' on item i, denoted by r_{u'i}, is multiplied by their user similarity, sim(u,u'). Such scores are summed up upon all similar users of u, and the result is normalized by the sum of user similarities. 

The final list of predicted item ratings for u is ranked and the top ranked items are served as recommendations to the query user.

If you want to dive deeper into the topic, here are a couple of useful links:


Sunday, November 3, 2013

2. Data Preprocessing

Before applying any mining algorithm, one should prepare the data so as to be suitable for the algorithm. This involves some kind of transformation of the data in a suitable for the algorithm format and also, dealing with problems in data like noise and outliers.
In many applications there exist raw data that are problematic, that is, their quality is low. There are different reasons for this low quality, e.g.:
  • Noisy data: e.g. an age value of 1000 is not valid. 
  • Incomplete data: missing values or even missing important for the analysis attributes 
  • Inconsistent data: e.g. different movie ratings between different databases, one might use the 1-5 range and the other the 1-10 range. If we want to analyze data from both systems, we should find some kind of mapping between the two ranges.
It is obvious that if we rely upon "dirty data", the mining results will be of low quality. The better the quality of the raw data, the better the quality of the mining results. 
So, preprocessing is a very important step in the mining process, which takes usually a lot of time, depending of course on the application per se and on the quality of the original/ raw data.

Major tasks in data preprocessing

  • Data cleaning:  e.g., fill in missing values, remove outliers
  • Data integration: "combine" data from different sources. Issues to be solved: entity resolution (which feature of one source maps to which feature of the other source), value resolution (like the different ranges problem described above).
  • Data transformation: e.g. normalize the data in a given range (usually 0-1), generalize (e.g. from the x,y coordinates level to the city level).
  • Data reduction: aggregation (e.g. instead of 12 values for the salary attribute (1 per month) of a person, just keep the avg salary per month), dimensionality reduction (remove redundant dimensions, e.g. no need to use both birthday and age for the analysis), duplicate elimination. 

Features

Although there are different types of data (numerical, text, images, videos, graph, ... ), the analysis relies upon features extracted from these data. So, the mining methods are somehow global since they are applied over features, extracted from different application domains.
You can think of the features as the columns/fields of a table in a database. So,  the features are the properties/ characteristics of the data.

Depending on the application, feature extraction might be a challenging step itself. For example, there is a lot of work on extracting features from text data (e.g., through TFIDF) or from images (e.g. color histograms).

There are different types of features/attributes:
  • binary, e.g., smoker (yes/no), glasses(yes/no),  gender(male/female)
  • categorical or nominal, e.g., eye color (brown, blue, green), occupation (engineer, student, teacher,...) 
  • ordinal, e.g. medal(bronze, silver, gold)
  • numerical, e.g. age, income, height
Numerical data are the most common. The numerical values might be the original values (e.g. x,y coordinates in a spatial application) or the result of some transformation (e.g. TFIDF values in text data).

Univariate Feature/Attribute descriptors
There are different measures to "characterize" single attributes. The reason we use these descriptors is to understand the distribution of the attribute values and clean the data by e.g. removing outliers.
For numerical features, the most common ones are: 
mean/avg, median, mode (all these are measures of central tendency of the attribute) and range, quantiles, standard deviation, variance (all these are measures of dispersion of the attribute).

Bivariate Feature/Attribute descriptors
There are different measures to "characterize" the correlation between two attributes. The reason we use these descriptors is to find possible redundant attributes and do not consider both for the analysis.
For numerical data, the most common measure is correlation coefficient.
For categorical data, χ^2(chi-square).

1. Knowledge Discovery in Databases (KDD) and Data Mining (DM)

More and more data of different types (text, audio, images, videos,...) are collected nowadays from different data sources (telecommunication, science, business, health-care systems, WWW,...).

Due to their quantity and complexity, it is impossible for humans to exploit these data collections through some manual process. Here comes the role of Knowledge Discovery in Databases (KDD), which aims at discovering knowledge hidden in vast amounts of data.

The KDD process consists of the following steps (see the picture below):
  1. Selection of data which are relevant to the analysis task
  2. Preprocessing of these data, including tasks like data data cleaning and data integration
  3. Transformation of the data into forms appropriate for mining
  4. Application of Data Mining algorithms for the extraction of patterns
  5. Interpretation/evaluation of the generated patterns so as to identify those patterns that represent real knowledge, based on some interestingness measures.


The Data Mining (DM) step is one of the core steps of the KDD process.
Its goal is to apply data analysis and knowledge discovery algorithms that
produce a particular enumeration of patterns (or models) over the data.

The KDD process was introduced in the following paper:

Usama Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth (1995), “From Knowledge Discovery to Data Mining:  An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

What do we mean by the term patterns or data mining models?
One can think of patterns as comprehensive summaries of the data or as higher level description of the data. Different types of patterns exist, like: clusters, decision trees, association rules, frequent itemsets, sequential patterns.

There are a lot of informative resources on DM and KDD that one can user for further reading. I appose some of them below:

Sunday, November 25, 2012

Stored procedures in MySQL


Some links with useful examples for writting stored procedures:

Integer division in Java


The division of two integers in Java is also an integer.
E.g.    
    int a=5;
    int b=10;
    System.out.println(a/b);
It will output 0.

One could use double instead. 
E.g.
    double a=5;
    double b=10;
    System.out.println(a/b);
It will output 0.5.

More here: http://stackoverflow.com/questions/8110019/java-divison-with-two-integer-operands-not-working

Thursday, August 23, 2012

Free online data mining books

For the interested reader, there are several data mining books available online for free. I appose some of them below:
The emphasis in the book by Rajanaman, Leskovec and Ullman is on mining of very large amounts of data, i.e., data that do not fit in main memory. WWW is a source of such data and it is used quite often as an example in this book.

Below is the table of contents:
  1. Data Mining
  2. Large-Scale File Systems and Map-Reduce
  3. Finding Similar Items
  4. Mining Data Streams
  5. Link Analysis
  6. Frequent Itemsets
  7. Clustering
  8. Advertising on the Web
  9. Recommendation Systems
  10. Mining Social-Network Graphs

Monday, August 6, 2012

Distribution skewness

Skewness

A distribution is skewed if one of its tails is longer than the other.

  • Negative skewness: the distribution has a long tail in the negative direction. 
  • Symmetric: no skewness. 
  • Positive skewness: the distribution has a long tail in the positive direction.
Examples are presented below.




Bisecting k-Means


Bisecting k-Means is like a combination of k-Means and hierarchical clustering.
It starts with all objects in a single cluster.

The pseudocode of the algorithm is displayed below:

Basic Bisecting K-means Algorithm for finding K clusters
  1. Pick a cluster to split.
  2. Find 2 sub-clusters using the basic k-Means algorithm (Bisecting step)
  3. Repeat step 2, the bisecting step, for ITER times and take the split that produces the clustering with the highest overall similarity.
  4. Repeat steps 1, 2 and 3 until the desired number of clusters is reached.

The critical part is which cluster to choose for splitting. And there are different ways to proceed, for example, you can choose the biggest cluster or the cluster with the worst quality or a combination of both.

Source: "A comparison of document clustering techniques", M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. [pdf]




Sunday, April 29, 2012

1. Introduction to KDD and DM

The first lecture on Knowledge Discovery in Databases (KDD) is about motivation and introducing the different concepts: 
  1. Why KDD?
  2. Application examples
  3. What is KDD?
  4. What is Data Mining (DM)?
  5. Supervised vs unsupervised learning
  6. Main DM tasks

Question 1 is a very easy one. There are so many data nowadays (quantity) and in so many different forms like text, audio, images, video, etc (complexity) that it is impossible to exploit them manually. So, methods are required which help us to extract "knowledge" from such kind of data in an (semi) automatic way. Here comes the role of Knowledge Discovery in Databases (KDD) and Data Mining (DM).


Retail industry, telecommunications, health care systems, WWW, science, business are some of the examples where KDD is applied (Question 2).

KDD is a process (Question 3), that is, it consists of several steps (see the picture below):
  1. Selection of data which are relevant to the analysis task
  2. Preprocessing of these data, including tasks like data data cleaning and data integration
  3. Transformation of the data into forms appropriate for mining
  4. Application of Data Mining algorithms for the extraction of patterns
  5. Interpretation/evaluation of the generated patterns so as to identify those patterns that represent real knowledge, based on some interestingness measures.


If, at some step, you realize some error or "strange" results you can go back to a previous step in the process and revise it (thus why the feedback loops in the picture). 

The KDD process was introduced in the following paper:

Usama Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

The Data Mining (DM) step  (Question 4)  is one of the core steps of the KDD process. Its goal is to apply data analysis and knowledge discovery algorithms that produce a particular enumeration of patterns (or models) over the data. 
Although DM its the most "famous" step and sometimes the whole procedure is called DM instead of KDD, it heavily relies on the other steps. I mean, if your preprocessing "sucks", the mining results will also "suck".

The result of this step are the patterns or data mining models
One can think of patterns as comprehensive summaries of the data or as higher level description of the data. Different types of patterns exist, like: clusters, decision trees, association rules, frequent itemsets, sequential patterns.

(Question 5There are two ways of learning from data: the supervised way and the unsupervised way.
  • Supervised learning: 
there are given some examples/ instances of the problem where the true classes/labels of the data are given. 
For example, in a medical application you have a set of people for which you know whether they have some specific disease (class: yes) or not (class: no). The goal is to build a model for each of the different classes, so as, for any a new instance for which you have no clue about its class, to predict its class.
In our example, for a new person the model should decide whether he/she suffers from the specific disease (class: yes) or not (class: no).
  • Unsupervised learning: 
there is no apriori knowledge in the instances about the right answer. Based on the instance characteristics, the instances are organized into groups of similar instances.
For example, in a news aggregation site like google news, the news posts are organized into groups of news posts referring to the same topic. To do so, the similarity between the news posts is employed.

(Question 6) The main DM tasks are:
  • Clustering
  • Classification
  • Association rules mining
  • Regression
  • Outlier detection
We will explain each of them in detail in later posts. 

Hello again

My first hello post was back in 2009. 
At that time, I wanted to blog about data mining related issues but for some reason I lost my motivation quite fast :-(
I decided to restart now. I am doing a data mining lecture this semester so I think its a good motivation to restart.



Thursday, June 30, 2011

CERN experiments generating one petabyte of data every second

Ouaou .... The amount of data that is generated from the experiments in CERN is HUGE!!!! 1 PB/sec!!!!

Of course they do not save all these data but only the interesting measurements (based on filters), but again the amount is incredible.
"CERN is storing 25PB of data every year – the same as 1,000 years' worth of DVD quality video..."
These data are used for further analysis. Just to recall that the goal of CERN is to answer fundamental questions like what makes our Universe work, where it came and where it is going ...

Check the article by Dan Worth at CERN here.
Some cool photos from the experimental area here
And some photos from various events during collisions here