Each row of data represents an observation about something in the world.
When working with data, we often do not have access to all possible observations. This could be for many reasons; for example:
- It may difficult or expensive to make more observations.
- It may be challenging to gather all observations together.
- More observations are expected to be made in the future.
Observations made in a domain represent samples of some broader idealized and unknown population of all possible observations that could be made in the domain. This is a useful conceptualization as we can see the separation and relationship between observations and the idealized population.
We can also see that, even if we intend to use big data infrastructure on all available data, that the data still represents a sample of observations from an idealized population.
Nevertheless, we may wish to estimate properties of the population. We do this by using samples of observations.
Sampling consists of selecting some part of the population to observe so that one may estimate something about the whole population.