Scraping things that don't want to be scraped is one of my favorite things to do...

tmpz22 · on Oct 11, 2021

To be fair selenium style scraping can take a lot of time to setup if you aren’t already familiar with the tooling, and the browser rendering apis are unintuitive and sometimes flat out broken.

dec0dedab0de · on Oct 11, 2021

Maybe it's because I'm using the python bindings, but it took me about an hour to go from never using it to having it do what I needed it to do. I just messed around in a jupyter notebook until I got what I needed working. Tab complete on live objects is your friend. The hardest part was figuring out where to download a headless browser from.

Though I do prefer requests/bs4. I wrote a helper to generate a requests.Session object from a selenium Browser object. I had something recently where the only thing I needed the javascript engine for was a login form that changed. So by doing it this way I didn't have to rewrite the whole thing. Still kind of bothers me I didn't take the time to figure out how to do it without the headless browser, but it works fine, and I have other things to do.

ipaddr · on Oct 11, 2021

That's why things like laravel's Dusk exists to put a layer over that complex experience.

chinchilla2020 · on Oct 11, 2021

I was surprised not to see selenium in this article. It is a common tool

marvram · on Oct 12, 2021

You're absolutely right. It slipped my mind as I considered it more of a language-agnostic tool, and I organized the article as the provisioning of tools for all popular programming languages. That said, I added it to the post as a language-agnostic tool - thanks for the pointer!

eastendguy · on Oct 11, 2021

> Scraping things that don't want to be scraped

If all else fails, no website can withstand OCR-based screen scraping. It is slow(er), but fast enough for many use cases.

elorant · on Oct 11, 2021

Assuming that you eventually manage to load the page somehow. Which in some edge cases may entail simulating mouse movements and random delays.

eastendguy · on Oct 12, 2021

Agreed. -> I use the ui.vision extension to simulate native mouse movements.

timwis · on Oct 11, 2021

Have you tried on a page protected by cloudflare captcha?

1vuio0pswjnm7 · on Oct 11, 2021

Its funny I never seem to hit these infamous Clouflare captchas. The only impediment I encounter with Cloudflare is they require plaintext SNI to read their blog, https://blog.cloudflare.com. Unlike almost all other Cloudflare, ESNI will not work.

dec0dedab0de · on Oct 12, 2021

I have not had to deal with that, but I have idly thought that it might be easier to pipe the audio version into google assistant or something, and see what it comes up with.

eastendguy · on Oct 12, 2021

It seems to be no problem if you automate a real browser as opposed to a headless browser. I think they test for that.

mkl · on Oct 11, 2021

A browser extension is probably an easier way to extract text than OCR (unless you're targeting a wide range of sites, I suppose).

jamesfinlayson · on Oct 11, 2021

I remember a workmate having to deal with some difficult to scrape data at a previous job - the page randomly rendered with different mark-up (but the same appearance) to mitigate pulling out data using selectors. I think he got to the bottom of it eventually but it made testing his work a pain.

no_time · on Oct 12, 2021

Playwright's layout selectors might help the next time you encounter this.

https://playwright.dev/docs/selectors#selecting-elements-bas...