When I learned coding and data science as a business student through online courses, I disliked that datasets were made up of fake data or were solved before like Boston House Prices or the Titanic dataset on Kaggle.In this blog post, I want to show you how I develop interesting data science project ideas and implement them step by step, such as exploring Germany’s biggest frequent flyer forum Vielfliegertreff. If you are short on time feel free to skip to the conclusion TLDR.
Step 1: Choose your relevant passion topic
As a first step, I think about a potential project that fulfills the following three requirements to make it the most interesting and enjoyable:
- Solving my problem or burning question
- Connected to some recent event to be relevant or especially interesting
- Has not been solved or covered before
As these ideas are still quite abstract, let me give you a rundown of how my three projects fulfilled the requirements:
As a beginner do not strive for perfection, but choose something you are genuinely curious about and write down all the questions you want to explore in your topic.
Step 2: Start Scraping together your own dataset
Given that you followed my third requirement, there will be no dataset publicly available and you will have to scrape data together yourself. Having scraped a couple of websites, there are 3 major frameworks I use for different scenarios:
For Vielfliegertreff, I used scrapy as a framework for the following reasons:
- The website structure was complex having to go from each forum subject, to all the threads and from all the treads to all post website pages. With scrapy you can easily implement complex logic yielding requests that lead to new callback functions in an organized way.
- There were quite a lot of posts so crawling the entire forum will definitely take some time. Scrapy allows you to asynchronously scrape websites at an incredible speed.
To give you just an idea of how powerful scrapy is, I quickly benchmarked my MacBook Pro (13-inch, 2018, Four Thunderbolt 3 Ports) with a 2,3 GHz Quad-Core Intel Core i5 that was able to scrape around 3000 pages/minute:
To be nice and not to get blocked, you must scrape gently, by for example enabling scrapy’s auto-throttle feature. Furthermore, I also saved all data to a SQL lite database via an items pipeline to avoid duplicates and turned it on to log each URL request to make sure I do not put more load on the server if I stop and restart the scraping process.
Knowing how to scrape gives you the freedom to collect datasets by yourself and teaches you also important concepts about how the internet works, what a request is, and the structure of html/XPath.
For my project, I ended up with 1.47 GB of data which was close to 1 million posts in the forum.