Scraping things that don't want to be scraped is one of my favorite things to do. At work this is usually an interface for some sort of "network appliance." Though with the push for REST APIs over the last 6 years or so, I don't have a need to do it all to often. Plus with things like selenium it's too easy to just run the page as is, and I can't justify spending the time to figuring out the undocumented API.
My favorite one implemented CSRF protections by polling an endpoint, and adding in the hashed data from that endpoint and a timestamp on every request.
When I hear a junior dev give up on something because the API doesn't provide the functionality of the UI, It makes me very sad that they're missing out.
To be fair selenium style scraping can take a lot of time to setup if you aren’t already familiar with the tooling, and the browser rendering apis are unintuitive and sometimes flat out broken.
Maybe it's because I'm using the python bindings, but it took me about an hour to go from never using it to having it do what I needed it to do. I just messed around in a jupyter notebook until I got what I needed working. Tab complete on live objects is your friend. The hardest part was figuring out where to download a headless browser from.
Though I do prefer requests/bs4. I wrote a helper to generate a requests.Session object from a selenium Browser object. I had something recently where the only thing I needed the javascript engine for was a login form that changed. So by doing it this way I didn't have to rewrite the whole thing. Still kind of bothers me I didn't take the time to figure out how to do it without the headless browser, but it works fine, and I have other things to do.
You're absolutely right. It slipped my mind as I considered it more of a language-agnostic tool, and I organized the article as the provisioning of tools for all popular programming languages. That said, I added it to the post as a language-agnostic tool - thanks for the pointer!
Its funny I never seem to hit these infamous Clouflare captchas. The only impediment I encounter with Cloudflare is they require plaintext SNI to read their blog, https://blog.cloudflare.com. Unlike almost all other Cloudflare, ESNI will not work.
I have not had to deal with that, but I have idly thought that it might be easier to pipe the audio version into google assistant or something, and see what it comes up with.
I remember a workmate having to deal with some difficult to scrape data at a previous job - the page randomly rendered with different mark-up (but the same appearance) to mitigate pulling out data using selectors. I think he got to the bottom of it eventually but it made testing his work a pain.
My favorite one implemented CSRF protections by polling an endpoint, and adding in the hashed data from that endpoint and a timestamp on every request.
When I hear a junior dev give up on something because the API doesn't provide the functionality of the UI, It makes me very sad that they're missing out.