+Statistics

Statistics is the study of the collection, organization, and interpretation of data.

Statistical inference

Aka inductive statistics, inferential statistics.

Inference from data subject to randomness

This is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation. Probability theory provides us a way of handling this uncertainty.

Also see the Inference survey for other scenarios where random variation is lesser.

Modeling problems

Some problems are very closely tied to particular applications; eg: many vision problems, etc.. Here, the art is often in picking the right model; ie in proper abstraction. Eg: Is it better to use linear dimensionality reduction or manifold learning? Is it better to cast it as a regression problem or a classification problem? Often, simple models work surprisingly well. Also see comments in the `probabilistic models’ survey.

The decision about what you model and what you don’t is important. Often, prior work is advanced by designing a model which modeled more aspects of the problem than previous models.

Once the problem is modeled, one must find a suitable estimation algorithm.

See also comments

Solving Modeled problems

This is algorithm design and analysis. See algorithms survey too.

Here, you take already-modeled well-specified problems, and try to find upper and lower bounds on resources (time, sample points, memory, randomness .. ) required for various estimation tasks.

Note that finding upper bounds on resources often involves designing new estimation algorithms to solve the problem. This is often done by modifying an algorithm for a closely related model.

Eg: Pattern recognition in a Stream: very limited memory and processing time. Eg: sensors in bridges.

Degrees of abstractness

Domain specific models

Usually solutions tied to particular domains tend to have short term, shallow impact. But, sometimes the general modeling technique and the estimation algorithm in them can be generalized to other domains, leading to greater impact. Also, the problem solved itself may be very important from a non-data-mining point of view; like the astrophysics or biology points of view.

Abstract models

Impact is often intangible: it improves the intuition/ understanding of problems by practitioners. Good solutions to important abstract problems tend to have long-term, deep impact.

Research effort

Read many papers; Critique the problems posed and the solutions presented; alter problems and argue why it is interesting; criticize solutions, develop better solutions for original problem.

Analysis of efficiency and complexity

For learning theory research, see colt ref. Statistical learning theory: find convergence rate in estimating parameter.

Modeling and Experimentation

Purpose

Modelling the problem often requires exploration and experimentation.

Show that the techniques, even if mathematically well motivated, works well in practice. If it fails, in what cases does it fail?

Avoiding unnecessary work

One must approach the problem in a mathematically principled, disciplined manner. Doing so can avoid expending much unnecessary effort. One must take advantage of prior research and predetermined thought patterns.

Degree of experimentation

This depends highly on abstractness of the problem. If the problem is very specific to some domain, eg: vision or social network analysis etc; experimentation is much. One has to check the goodness of one’s model with the ground truth.

Data preparation

Relevant mainly in case of domain-specific research; Usually not present in the case you try to answer more abstract questions.

Huge effort often involved in data-preparation.

Following research

ICML, KDD, NIPS. Application conferences: CVPR etc.. Theory: archiv, colt.

Data mining conferences tend to be more about applications and clustering than machine learning conferences.

Software

Rapidminer, R, Matlab are frequently used.

Keep up with developments in efficient software to solve common problems: for example, by joining the R package mailing list.