A statistical population exists. Thence, a sample of
Properties
Sample statistics are considered elsewhere in the distribution structure learning pages.
Sample bias
Pick 2 people: he is either Indian or Chinese; but there exist over 180 other countries.
Completeness, accuracy of examples
It is possible that, some points are not completely specified. For a certain point
Independence of data points
In some cases,
In other cases, the input points
Sequential
Firstly, sequential data points can be ordered:
So, data points need not be considered sequential merely because of the presence of a Time/ Position feature. Eg: Identifying words with spelling mistakes in a sentence.
Adversarial
Or they may be chosen adversarially : See game theory ref.
Active choice
Or, as in the case of active learning problems, the learner can take actions to change input distribution. Reinforcement learning is considered in the AI survey.
Labeling of the data.
Some features, aka the label, may be a (unknown) function of the others. Thence, deducing dependence of label on other features is the prediction problem.
In case of small sample
Usually insufficient data to guess shape of distribution; can only see large effects: so need large sample to see small effects; only extreme outliers stand out remarkably.
Few high dimensional data-points
Examples
Take the brain activity vs neuronal activity matrix. Activity-levels of millions of neurons is feature for each data-point. Brain activity is highly defined by a very small number of spiking neurons.
Gene expression vs exterior condition matrix.
Social networks: Individual activity vs group action matrix.
Exploration
Whatever the statistical problem is, exploratory analysis is often the first step in solving the problem.
One often studies the empirical distribution of the data and estimates the central tendency, range, median, mode, characteristics of outliers, number of missing values. It is further described in the distribution structure learning part.
One may also cluster the data.