How does data cleaning plays a vital role in the analysis?
Data ,when collected from various resources as stated in (https://medium.com/@TheDataGyan/day-6-getting-data-in-r-9b704ac9c31d)can be really untidy.
- It may not be segregated in terms of it’s feature values, neither it might be available in a clean tabular format.
- It can be redundant, full of missing values and outliers( Values which are very far from the desired range of a feature).
- It may not be understandable.
- It may not have well defined format.
so before appliying it to the model it needs to be processed. This is called as Data cleaning
REF: https://medium.com/sciforce/data-cleaning-and-preprocessing-for-beginners-25748ee00743
Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can work with is a cumbersome process because - as the number of data sources increases, the time take to clean the data increases exponentially due to the number of sources and the volume of data generated in these sources. It might take up to 80% of the time for just cleaning data making it a critical part of analysis task.