Scraping WordPress - 4300 court rulings in exchange rate lawsuits without a line of code
It is not often that the execution of a service takes longer than its pricing, but with scraping, this can happen. See how easy it can be to retrieve data, especially from WordPress.
Daniel Gustaw
• 2 min read
It is not often that the execution of a service takes less time than its estimation, but with scraping, this can happen. Scraping is similar to hacking in that, depending on the security measures and the complexity of the system from which we are extracting data, it can be either trivially simple or pose a serious challenge.
In this post, I will show how I performed the scraping service before I had time to estimate it. I did not write a single line of code, and the whole process took me a few minutes.
What the client needed:
The inquiry was about the database of court judgments from the site
https://nawigator.bankowebezprawie.pl/pozwy-indywidualne/
Thanks to the Wappalyzer plugin, we can read that it is WordPress - an ancient technology that is usually friendly to scraping, as its choice indicates a lack of funds for any anti-scraping actions.
The table reloads in real-time. Pagination does not change the URLs. This is a typical solution for the datatable
package which is a jquery
plugin.
On the page of this plugin, we will find the same table, just with slightly modified styles:
These are sufficient clues to suggest that the data for the table is loaded from a single endpoint. A quick analysis of network traffic does not show anything interesting, but showing the page source does:
The rest of the service was just about selecting those few thousand lines of text and saving them in a json
file. Potentially for the convenience of the end user, conversion to csv
or xlsx
, for example on the page
Links to downloaded data:
https://preciselab.fra1.digitaloceanspaces.com/blog/scraping/pc.json
https://preciselab.fra1.digitaloceanspaces.com/blog/scraping/pc.json.xlsx
At the end, I would like to emphasize that although access to this data is free, the people working on its structuring are doing so on a voluntary basis to achieve the goal set by the association:
B) collecting information about unfair practices of entrepreneurs and other cases of legal violations by these entities, and developing and publicly sharing information, articles, reports, and opinions in this regard.
https://rejestr.io/krs/573742/stowarzyszenie-stop-bankowemu-bezprawiu
If you want to benefit from their work, I encourage you to support them on their website
Other articles
You can find interesting also.
Infrastructure as Code (Terraform + Digital Ocean)
In this post, I show how to set up servers using the terraform command line.
Daniel Gustaw
• 3 min read
Scraping from money.pl in 30 lines of code.
See a simple case study of downloading and processing data from a paginated table.
Daniel Gustaw
• 8 min read
Compilation of PHP 7 interpreter in BunsenLabs
Compilation is a process that sometimes requires installing packages or linking dependencies. In this case, the task was to deliver php7 to a system that did not have it in the available repositories.
Daniel Gustaw
• 8 min read