Data Mining stories: 2012

Sunday, November 25, 2012

Stored procedures in MySQL

Some links with useful examples for writting stored procedures:

"Simple Example of a MySQL Stored Procedure that uses a cursor", a nice article explaining how to use cursor and how to debug a stored procedure in mysql

The division of two integers in Java is also an integer.
E.g.
int a=5;
int b=10;
System.out.println(a/b);
It will output 0.

One could use double instead.
E.g.
double a=5;
double b=10;
System.out.println(a/b);
It will output 0.5.

More here: http://stackoverflow.com/questions/8110019/java-divison-with-two-integer-operands-not-working

Thursday, August 23, 2012

Free online data mining books

For the interested reader, there are several data mining books available online for free. I appose some of them below:

Data Mining: Concepts and Techniques, by Jiawei Han and Micheline Kamber
Data Mining: Introductory and Advanced Topics by Margaret Dunham
Data Mining for the coorporate masses, by Neal Leavitt

Mining of Massive Datasets by Anand Rajaraman, Jure Leskovec and Jeﬀrey D. Ullman.

The emphasis in the book by Rajanaman, Leskovec and Ullman is on mining of very large amounts of data, i.e., data that do not fit in main memory. WWW is a source of such data and it is used quite often as an example in this book.

Below is the table of contents:

Data Mining
Large-Scale File Systems and Map-Reduce
Finding Similar Items
Mining Data Streams
Link Analysis
Frequent Itemsets
Clustering
Advertising on the Web
Recommendation Systems
Mining Social-Network Graphs

Saturday, August 11, 2012

Top conferences in Data Mining

The top-10 Data Mining conferences according to Microsoft Academic Search are:

For an updated list, visit http://academic.research.microsoft.com/RankList?entitytype=3&topDomainID=2&subDomainID=7

Monday, August 6, 2012

Distribution skewness

Skewness

A distribution is skewed if one of its tails is longer than the other.

Negative skewness: the distribution has a long tail in the negative direction.
Symmetric: no skewness.
Positive skewness: the distribution has a long tail in the positive direction.

Examples are presented below.

Bisecting k-Means

Bisecting k-Means is like a combination of k-Means and hierarchical clustering.

It starts with all objects in a single cluster.

The pseudocode of the algorithm is displayed below:

Basic Bisecting K-means Algorithm for finding K clusters

Pick a cluster to split.

Find 2 sub-clusters using the basic k-Means algorithm (Bisecting step)

Repeat step 2, the bisecting step, for ITER times and take the split that produces the clustering with the highest overall similarity.

Repeat steps 1, 2 and 3 until the desired number of clusters is reached.

The critical part is which cluster to choose for splitting. And there are different ways to proceed, for example, you can choose the biggest cluster or the cluster with the worst quality or a combination of both.

Source: "A comparison of document clustering techniques", M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. [pdf]

Sunday, April 29, 2012

1. Introduction to KDD and DM

The first lecture on Knowledge Discovery in Databases (KDD) is about motivation and introducing the different concepts:

Why KDD?
Application examples
What is KDD?
What is Data Mining (DM)?
Supervised vs unsupervised learning
Main DM tasks

Question 1 is a very easy one. There are so many data nowadays (quantity) and in so many different forms like text, audio, images, video, etc (complexity) that it is impossible to exploit them manually. So, methods are required which help us to extract "knowledge" from such kind of data in an (semi) automatic way. Here comes the role of Knowledge Discovery in Databases (KDD) and Data Mining (DM).

Retail industry, telecommunications, health care systems, WWW, science, business are some of the examples where KDD is applied (Question 2).

KDD is a process (Question 3), that is, it consists of several steps (see the picture below):

Selection of data which are relevant to the analysis task
Preprocessing of these data, including tasks like data data cleaning and data integration
Transformation of the data into forms appropriate for mining
Application of Data Mining algorithms for the extraction of patterns
Interpretation/evaluation of the generated patterns so as to identify those patterns that represent real knowledge, based on some interestingness measures.

If, at some step, you realize some error or "strange" results you can go back to a previous step in the process and revise it (thus why the feedback loops in the picture).

The KDD process was introduced in the following paper:

Usama Fayyad, Gregory Piatetsky-Shapiro and Padhraic Smyth (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

The Data Mining (DM) step (Question 4) is one of the core steps of the KDD process. Its goal is to apply data analysis and knowledge discovery algorithms that produce a particular enumeration of patterns (or models) over the data.

Although DM its the most "famous" step and sometimes the whole procedure is called DM instead of KDD, it heavily relies on the other steps. I mean, if your preprocessing "sucks", the mining results will also "suck".

The result of this step are the patterns or data mining models:

One can think of patterns as comprehensive summaries of the data or as higher level description of the data. Different types of patterns exist, like: clusters, decision trees, association rules, frequent itemsets, sequential patterns.

(Question 5) There are two ways of learning from data: the supervised way and the unsupervised way.

Supervised learning:

there are given some examples/ instances of the problem where the true classes/labels of the data are given.

For example, in a medical application you have a set of people for which you know whether they have some specific disease (class: yes) or not (class: no). The goal is to build a model for each of the different classes, so as, for any a new instance for which you have no clue about its class, to predict its class.

In our example, for a new person the model should decide whether he/she suffers from the specific disease (class: yes) or not (class: no).

Unsupervised learning:

there is no apriori knowledge in the instances about the right answer. Based on the instance characteristics, the instances are organized into groups of similar instances.

For example, in a news aggregation site like google news, the news posts are organized into groups of news posts referring to the same topic. To do so, the similarity between the news posts is employed.

(Question 6) The main DM tasks are:

Clustering
Classification
Association rules mining
Regression
Outlier detection

We will explain each of them in detail in later posts.

Hello again

My first hello post was back in 2009.
At that time, I wanted to blog about data mining related issues but for some reason I lost my motivation quite fast :-(
I decided to restart now. I am doing a data mining lecture this semester so I think its a good motivation to restart.

Data Mining stories