Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On #1, you mentioned before that Craigslist disallows scraping, yet unless you are OmniExplorer, it seems scraping is mostly fair game if robots.txt means anything. The robots.txt standard mentions nothing about it being for search engines [2], so there are no special exclusions for search engines specifically.

Additionally, robots.txt is really for automated link traversal, not scrapers in general. If your scraper is initiated by a user, there is no need to follow robots.txt. Not even Google does when the request is user-initiated [3].

From there, the waters just become really murky. Is lynx a scraper because it doesn't render the way most web browsers do? Does it get a pass because it still adheres to web standards? What if a real scraper adheres to web standards? Maybe it is the storage of scraped data that is the issue? What about caches? I could go on, but I'm sure you see what I'm getting at. It's a very complex issue that is not at all understood.

[1] http://www.craigslist.org/robots.txt

[2] http://www.robotstxt.org/robotstxt.html

[3] http://support.google.com/webmasters/bin/answer.py?hl=en&...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: