A step-by-step guide for creating an authentic data science portfolio project
7 min read

A step-by-step guide for creating an authentic data science portfolio project

As an inspiring data scientist, building interesting portfolio projects is key to showcase your skills. Follow me along on how I explored Germany’s largest travel forum Vielfliegertreff.
A step-by-step guide for creating an authentic data science portfolio project

When I learned coding and data science as a business student through online courses, I disliked that datasets were made up of fake data or were solved before like Boston House Prices or the Titanic dataset on Kaggle.In this blog post, I want to show you how I develop interesting data science project ideas and implement them step by step, such as exploring Germany’s biggest frequent flyer forum Vielfliegertreff. If you are short on time feel free to skip to the conclusion TLDR.

Why should you learn Python as a business student?
In this blog post, I explore and give you insights from a business student’s perspective of my coding journey and recommendations if I would start all over again.

Step 1: Choose your relevant passion topic

As a first step, I think about a potential project that fulfills the following three requirements to make it the most interesting and enjoyable:

  1. Solving my problem or burning question
  2. Connected to some recent event to be relevant or especially interesting
  3. Has not been solved or covered before

As these ideas are still quite abstract, let me give you a rundown of how my three projects fulfilled the requirements:

Overview of my own data science portfolio projects fulfilling the three outlined requirements.

As a beginner do not strive for perfection, but choose something you are genuinely curious about and write down all the questions you want to explore in your topic.

Step 2: Start Scraping together your own dataset

Given that you followed my third requirement, there will be no dataset publicly available and you will have to scrape data together yourself. Having scraped a couple of websites, there are 3 major frameworks I use for different scenarios:

Overview of the 3 major frameworks I use for scraping.

For Vielfliegertreff, I used scrapy as a framework for the following reasons:

  1. There were no JavaScript-enabled elements that were hiding data.
  2. The website structure was complex having to go from each forum subject, to all the threads and from all the treads to all post website pages. With scrapy you can easily implement complex logic yielding requests that lead to new callback functions in an organized way.
  3. There were quite a lot of posts so crawling the entire forum will definitely take some time. Scrapy allows you to asynchronously scrape websites at an incredible speed.

To give you just an idea of how powerful scrapy is, I quickly benchmarked my MacBook Pro (13-inch, 2018, Four Thunderbolt 3 Ports) with a 2,3 GHz Quad-Core Intel Core i5 that was able to scrape around 3000 pages/minute:

Scrapy scraping benchmark. (Image by Author)

To be nice and not to get blocked, you must scrape gently, by for example enabling scrapy’s auto-throttle feature. Furthermore, I also saved all data to a SQL lite database via an items pipeline to avoid duplicates and turned it on to log each URL request to make sure I do not put more load on the server if I stop and restart the scraping process.

Knowing how to scrape gives you the freedom to collect datasets by yourself and teaches you also important concepts about how the internet works, what a request is, and the structure of html/XPath.

For my project, I ended up with 1.47 GB of data which was close to 1 million posts in the forum.

Step 3: Cleaning your dataset

This post is for subscribers only