How to make your own scraper and then forget about it?
So you've found a web page that changes frequently, and you want to follow the changes, but they don't provide a changelog? Then you might want to track the changes yourself.
I've done that on a couple of pages - most notably tracking how bonus point awards change on the Norwegian bonus point system Viatrumf. Feel free to check it out.
This solution consists of three parts:
- a scraper application running every morning
- a frontend displaying the results to you as an end user
- a simple backend-for-frontend
The scraper is the distinguishing part here. It's basically a serverless Python application running on Google Cloud Functions, set to run every morning through a cron job, which GCP implements through pub-sub.
The function, which you can find the source code for at Github, uses the library Scrapy to fetch the entire HTML from Viatrumf, and then parse the interesting elements from it.
I did some initial work to figure out how they have structured they HTML, and then used Scrapy to make a straight-forward parsing of that. Then, I converted it into a Python object, and from there on, I use the Firebase library to save each entry into Google Cloud Firestore.
I've chosen a structure where I put each entry into a new document every day. This way, I have all the data up until now available, and when building the frontend and backend for accessing and parsing the data, I can use them to show trends, create graphs, show differences et cetera.
PS: Even though I've done this in GCP, you can perfectly fine implement it at your cloud of choice.