Data Scientists spend up to 80% of their time cleaning data before they can explore and build a model. Cleaning data can be a lot simpler if you know the right libraries for the specific data cleaning problem you are facing. In this blog post, you will learn about the my favorite 3 libraries for cleaning your dirty data to save valuable hours of work.
Parsing and Cleaning Dates with dateparser
In a lot of scraped data points, you will have dates that do not follow any traditional DateTime format. For example in forums you might have some posts with relative dates, such as '3 minutes ago' that are very cumbersome to turn into normal DateTime format. Dateparser can also handle dates in various languages and even search dates for you in text.
Anonymize Data with clean-text
For one of the machine learning projects at N26, I used clean-text to preprocess chat transcripts that contained some personally identifiable information. Once the text was cleaned I classified each chat into larger topic clusters helping the business to analyze quickly each month what the greatest problems are that customers face:
While the library will not perfectly remove every special token with a 100% guarantee, it is a very quick and easy way to get started and refine your text later on.
Join/Compare Arbitrary Strings with fuzzywuzzy
Often you have data from two different sources that need to be joined together to have information available in a single source. When you do not have a key to join on and the text to join on in both sources is slightly different, fuzzywuzzy might save your life:
Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.
I hope you enjoyed this quick blogpost and find some of them useul in your next project!