Data cleaning is an essential part of any Data Science project.
The features with textual data usually have a lot of discrepancies which have to be cleared up before any analysis can be done.
Pandas offers a bunch of string operations that come in extremely handy during this step.
Let’s have a quick look over 3 most used ones:
- Stripping: There are cases when all the elements have a some garbage value in the prefix or the suffix of the strings.
It is essential to remove those, and the strip() function does exactly that.
There’s also dedicated rstrip() and lstrip() if you need to remove characters from only right or left end of the string.
- Filtering: Although filtering can be done in many ways, startswith() and endswith() function only filter on strings starting or ending with specific characters.
Returns True where the condition matches, and False where it doesn’t.
- Concatenating: Cat() is the function to use if you want to concatenate elements of two columns with a specified separator.
It also offers to specify what character to put if a missing value is encountered.
Have you used these functions?
What other handy operations do you know of?
#python #datascience #machinelearning