Data Mining stories: August 2012

Thursday, August 23, 2012

Free online data mining books

For the interested reader, there are several data mining books available online for free. I appose some of them below:

Data Mining: Concepts and Techniques, by Jiawei Han and Micheline Kamber
Data Mining: Introductory and Advanced Topics by Margaret Dunham
Data Mining for the coorporate masses, by Neal Leavitt

Mining of Massive Datasets by Anand Rajaraman, Jure Leskovec and Jeﬀrey D. Ullman.

The emphasis in the book by Rajanaman, Leskovec and Ullman is on mining of very large amounts of data, i.e., data that do not fit in main memory. WWW is a source of such data and it is used quite often as an example in this book.

Below is the table of contents:

Data Mining
Large-Scale File Systems and Map-Reduce
Finding Similar Items
Mining Data Streams
Link Analysis
Frequent Itemsets
Clustering
Advertising on the Web
Recommendation Systems
Mining Social-Network Graphs

Saturday, August 11, 2012

Top conferences in Data Mining

The top-10 Data Mining conferences according to Microsoft Academic Search are:

For an updated list, visit http://academic.research.microsoft.com/RankList?entitytype=3&topDomainID=2&subDomainID=7

Monday, August 6, 2012

Distribution skewness

Skewness

A distribution is skewed if one of its tails is longer than the other.

Negative skewness: the distribution has a long tail in the negative direction.
Symmetric: no skewness.
Positive skewness: the distribution has a long tail in the positive direction.

Examples are presented below.

Bisecting k-Means

Bisecting k-Means is like a combination of k-Means and hierarchical clustering.

It starts with all objects in a single cluster.

The pseudocode of the algorithm is displayed below:

Basic Bisecting K-means Algorithm for finding K clusters

Pick a cluster to split.

Find 2 sub-clusters using the basic k-Means algorithm (Bisecting step)

Repeat step 2, the bisecting step, for ITER times and take the split that produces the clustering with the highest overall similarity.

Repeat steps 1, 2 and 3 until the desired number of clusters is reached.

The critical part is which cluster to choose for splitting. And there are different ways to proceed, for example, you can choose the biggest cluster or the cluster with the worst quality or a combination of both.

Source: "A comparison of document clustering techniques", M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. [pdf]