Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Unfortunately the use of the sign often highlights what the scrapers want most, so if they pay attention to it (rather than just completely ignoring it as most do now) it will be to specifically follow where told not to.

The scrapers ideally want content that is original. Often content that is also new is more highly prized, but not as much as you might think⁰. This will only become more of a driver as the amount of LLM generated content that is out there to be mixed in increases, in order to limit the Habsburg problem they won't want too much regurgitated content in the training data.

Bad content from before LLM scraping became a resource problem¹ is highly unlikely to be marked in robots.txt, the same for content newly generated-by-an-LLM. People attempting to fend off scrapers and other bots with robots.txt entries are likely protecting the sort of content the scrapers actively want - original output that they've put some time into or code in a repo they don't want scraped (as scraping a repo is incredibly inefficient and resource heavy from the PoV of the repo owner).

I strongly suspect that the amount of desirable content behind robots.txt “blocks” is far too valuable to ignore despite the amount of poison content traps, or just things otherwise not worth the time scouring through, that might also be there. A “beware of the dog” sign is of no protection when the reader actively wants to see the doggies!

--------

[0] if scraping for training an LLM you don't want just new content, but you would prefer as much of your input data as possible to be as few steps as possible from original

[1] and a copying concern, though I'll avoid that discussion as it can get quite thorny and whichever side or fence you are on in that matter the resource consumption is objectively a problem all the same.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: