Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Scraping things that don't want to be scraped is one of my favorite things to do. At work this is usually an interface for some sort of "network appliance." Though with the push for REST APIs over the last 6 years or so, I don't have a need to do it all to often. Plus with things like selenium it's too easy to just run the page as is, and I can't justify spending the time to figuring out the undocumented API.

My favorite one implemented CSRF protections by polling an endpoint, and adding in the hashed data from that endpoint and a timestamp on every request.

When I hear a junior dev give up on something because the API doesn't provide the functionality of the UI, It makes me very sad that they're missing out.



To be fair selenium style scraping can take a lot of time to setup if you aren’t already familiar with the tooling, and the browser rendering apis are unintuitive and sometimes flat out broken.


Maybe it's because I'm using the python bindings, but it took me about an hour to go from never using it to having it do what I needed it to do. I just messed around in a jupyter notebook until I got what I needed working. Tab complete on live objects is your friend. The hardest part was figuring out where to download a headless browser from.

Though I do prefer requests/bs4. I wrote a helper to generate a requests.Session object from a selenium Browser object. I had something recently where the only thing I needed the javascript engine for was a login form that changed. So by doing it this way I didn't have to rewrite the whole thing. Still kind of bothers me I didn't take the time to figure out how to do it without the headless browser, but it works fine, and I have other things to do.


That's why things like laravel's Dusk exists to put a layer over that complex experience.


I was surprised not to see selenium in this article. It is a common tool


You're absolutely right. It slipped my mind as I considered it more of a language-agnostic tool, and I organized the article as the provisioning of tools for all popular programming languages. That said, I added it to the post as a language-agnostic tool - thanks for the pointer!


> Scraping things that don't want to be scraped

If all else fails, no website can withstand OCR-based screen scraping. It is slow(er), but fast enough for many use cases.


Assuming that you eventually manage to load the page somehow. Which in some edge cases may entail simulating mouse movements and random delays.


Agreed. -> I use the ui.vision extension to simulate native mouse movements.


Have you tried on a page protected by cloudflare captcha?


Its funny I never seem to hit these infamous Clouflare captchas. The only impediment I encounter with Cloudflare is they require plaintext SNI to read their blog, https://blog.cloudflare.com. Unlike almost all other Cloudflare, ESNI will not work.


I have not had to deal with that, but I have idly thought that it might be easier to pipe the audio version into google assistant or something, and see what it comes up with.


It seems to be no problem if you automate a real browser as opposed to a headless browser. I think they test for that.


A browser extension is probably an easier way to extract text than OCR (unless you're targeting a wide range of sites, I suppose).


I remember a workmate having to deal with some difficult to scrape data at a previous job - the page randomly rendered with different mark-up (but the same appearance) to mitigate pulling out data using selectors. I think he got to the bottom of it eventually but it made testing his work a pain.


Playwright's layout selectors might help the next time you encounter this.

https://playwright.dev/docs/selectors#selecting-elements-bas...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: