There are many pros associates with new technologies. You can easily chat with friends without leaving your bed, order things online and get them brought to your place without leaving home and even get pizza and pay for it, before it gets delivered. One of those super fancy features is possibility to register for sports activities. This way you can be sure, that there would be place waiting for you, but it also gives you convenient way to change your plans and change your booking activity. But there are also days, when your favorite workout is fully booked and you’re constantly refreshing the page waiting for someone to change their plans. This might be frustrating and it has happened to me many times. That’s why I decided to write a simple application, which will scrape booking page and notify me, when there will be place available for my favorite workout.
At first I wanted to use python with simple regex solution (you really shouldn’t use regular expressions to parse HTML. It will work for simple tag collection, but it turns out that many web pages are not so regular, as regular expressions are), but then I thought: – hey, maybe it is possible to deserialize HTML to POJO, just like you deserialize XML. And guess what? Someone already did it, and there is already ready solution for webscraping. You just need to annotate your POJO fields with proper CSS selectors and voilà, any webpage you read, can be transformed to java objects. It’s called jspoon and uses jsoup to parse HTML code. Principle of operation is similar to jackson library.
There is also another interesting library I used in my little project. It’s called retrofit and allows you to create fast and simple HTTP client for any API. It uses OkHttp as a HTTP client and has jspoon dedicated convertor, which makes it perfect solution for our problem.
So, let’s do the job. At first we need to determine the structure of page, we want to scrape. We’re good, it turns out to be simple HTML table
|
|
This is actually one workout from page containing every workout for certain day. So we need to actually only get two parameters: an hour and availibility number. But we’ll get also workout name, just to make this solution output clearer data. Our POJO code should look like following class
|
|
We mapped every needed property onto java POJO fields, but now we need to make a collection container, to read all workouts from page. So we do with following code:
|
|
The catch here is that we don’t point onto workout container, which in our case is table > tbody, but we must indicate CSS selector for workout field, which in our case is table > tbody > tr. Because every table row is mapped to single workout entry.
Having above classes we could easily use simple jspoon invocation and deserialize our workout entries:
|
|
But we’re getting a little creative here and we’ll use retrofit library to get the page. To do so, let’s create our API service interface with properly annotated method:
|
|
We declared HTTP GET operation and URL path for getting our workout entries. There is also dynamic query parameter day which is part of URL query and is being set from method parameter. With such interface, it is time for controller code:
|
|
As you can see, with builder pattern we’re setting retrofit engine with proper URL and data converter. And later, we use it to create API service and make the call. As simple as several lines of java code. Now let’s use streams to enhance Workout class functionality and return workout for specified hour
|
|
and maybe create entry point class, to read user params and invoke controller code
|
|
Above code altogether with previous classes will return workout for ISO date time format, which is smth like 2011-12-03T10:15 and will output similar to:
|
|
Of course you can change the code to output only availability and if it is bigger than 0 notify you in any choosen way. But this is outside of the scope of this post and will be described next time, when I show how to connect above mechanism to Facebook Messenger bot to set hook and get notified when again there is an possibility to register for the workout.
Whole solution can be downloaded / cloned from GitHub repo https://github.com/100c1p43r/CfNotify