The various ways in which we can select samples can be divided into two types:
Probability Sampling: Some researchers refer to this as random sampling.
Non-Probability sampling: This is also referred to as non-random sampling.
Whether you decided to go for a probability or a non-probability approach depends on the following factors:
Goal and scope of the study
Data collection methods that are feasible
Duration of the study
Level of precision you wish to have from the results
Design of the sampling frame and viability to maintain the frame
Thanks for bringing this topic up. As a data scientist / AI / ML person, you would need this topic often. Datasets sometimes are huge and building a model or doing EDA on a big data is not at all recommended right away. So best is to sample a sample a smaller version of data (probably more recent, or sampling from different from time frames, and this is where goal and scope of your ‘ml experimental study’ will come into action).
Clustered sampling out of the probability based would be my personal choice to try out. There are other types as well: simple random, systematic sampling and stratified sampling.
Check this article on clustered sampling using pandas.
thank you for the valuable guide. I need some help regarding big data, can I ask you where?