How different is data in Kaggle competitions from real data?

In the real-world you don’t download datasets, you are the one creating them.

Most models are currently being sourced from relational databases.

So, when you are given a problem, the data is often in a database. You author the SQL necessary to extract that data and cleanse it for modeling.

Additionally, the data may be in several locations. For example, your data could be in a SQL Server DB, an Oracle DB and in some text files. You’ll be the one creating the solution to amalgamate that data.

Additionally to your answer @chirag-garg , I think the problem statement is also not defined. For e.g. you have an e-commerce transaction data, which is normalized and stored on different tables, and do you want to predict users unlikely to be retained? In that case, building a table that has a ‘yes / no’ column as per your definition and the relevant features would be something of a significant task often experienced in reality that the kaggle competitions will most likely not have.

1 Like