What is a data science pipeline?

Data science pipeline depends on the particular business or industry as well wherein data science projects are operated. While in some cases, the entire set of steps starting data collection comes under the purview of data science pipeline; in other cases, data science pipeline starts from modelling clean data and extends up to generating business insights from the model output.
Still, typically a data science pipeline comprises five key steps – first is the logging of data which deals with the incoming stream of data; second is the storing of the incoming stream of data; third relates to processing & cleansing the data to make it usable; fourth refers to modelling the data; and finally interpreting & communicating the key findings & inferences.

Data science pipelines are sequences of processing and analysis steps applied to data for a specific purpose. They’re useful in production projects, and they can also be useful if one expects to encounter the same type of business question in the future, so as to save on design time and coding. For instance, one could remove outliers, apply dimensionality reduction techniques, and then run the result through a random forest classifier to provide automatic classification on a particular dataset that is pulled every week.