Difference between Pandas Object & String dtype 🐼

Do you pd.read_csv()? I bet you do.

When you import data into a Pandas DataFrame, Pandas by default tries to know the data types of each column.

However, the columns with text are by default marked as Object datatype.:bulb:

But Object dtype have a much broader scope. They can not only include strings, but also any other data that Pandas doesn’t understand.

After Pandas 1.0 (now 1.1.2), there’s a dedicated dtype to handle and work with text data, that is, String.:thinking:

How is this important?

When a column is Object type, it does not necessarily mean that all the values will be string.

In fact, they can all be numbers, or a mixture of string, integers and floats.

With this discrepancy present, you can not do any string operation on the column straightaway.

Moreover, having dtype as Object will make it less clear to work with just text and exclude the non-text values.

With the new String dtype, the values are explicitly treated as strings.

So, now you can extract and manipulate on only the strings within the columns by explicitly telling Pandas the dtype to be ‘String’ and not ‘Object’.:bulb:

Have you been using this new dtype?

What other use cases can this be beneficial in?

#python #datascience #machinelearning