Thursday, August 23, 2012

Free online data mining books

For the interested reader, there are several data mining books available online for free. I appose some of them below:
The emphasis in the book by Rajanaman, Leskovec and Ullman is on mining of very large amounts of data, i.e., data that do not fit in main memory. WWW is a source of such data and it is used quite often as an example in this book.

Below is the table of contents:
  1. Data Mining
  2. Large-Scale File Systems and Map-Reduce
  3. Finding Similar Items
  4. Mining Data Streams
  5. Link Analysis
  6. Frequent Itemsets
  7. Clustering
  8. Advertising on the Web
  9. Recommendation Systems
  10. Mining Social-Network Graphs

Monday, August 6, 2012

Distribution skewness

Skewness

A distribution is skewed if one of its tails is longer than the other.

  • Negative skewness: the distribution has a long tail in the negative direction. 
  • Symmetric: no skewness. 
  • Positive skewness: the distribution has a long tail in the positive direction.
Examples are presented below.




Bisecting k-Means


Bisecting k-Means is like a combination of k-Means and hierarchical clustering.
It starts with all objects in a single cluster.

The pseudocode of the algorithm is displayed below:

Basic Bisecting K-means Algorithm for finding K clusters
  1. Pick a cluster to split.
  2. Find 2 sub-clusters using the basic k-Means algorithm (Bisecting step)
  3. Repeat step 2, the bisecting step, for ITER times and take the split that produces the clustering with the highest overall similarity.
  4. Repeat steps 1, 2 and 3 until the desired number of clusters is reached.

The critical part is which cluster to choose for splitting. And there are different ways to proceed, for example, you can choose the biggest cluster or the cluster with the worst quality or a combination of both.

Source: "A comparison of document clustering techniques", M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. [pdf]