This is an interesting model in general: free for humans, pay for automation. Ho...

jerf · 2025-04-02T13:15:26 1743599726

Any plan that starts with "Step one: Apply the tool that almost perfectly distinguishes human traffic from non-human traffic" is doomed to failure. That's whatever the engineering equivalent of "begging the question" is, where the solution to the problem is that we assume that we have the solution to the problem.

zokier · 2025-04-02T13:56:18 1743602178

Identity verification is not that far fetched these days. For europeans you got eIDAS and related tech, some other places have similar stuff, for rest of world you can do video based id checks. There are plenty of providers that handle this, it's pretty commonplace stuff.

guerrilla · 2025-04-02T14:03:29 1743602609

This is a terrible idea. Consider how that will be abused as soon as a government who hates you comes in to power.

jerf · 2025-04-02T15:14:33 1743606873

That does not generically 100% solve the problem of "is this person a human". That ties things to an identity, but the verification that an identity is actually that human is not solved. Stolen identities, forged identities, faked identities, all still problems, and as soon as the full force of the black market capitalist market in such things is turned on that world, it'll still be a big problem.

Also video-based ID checks have a shelf-life measured in single-digit years now, if indeed the plural is even appropriate. The tricks for verifying that you're not looking at real-time faked face replacement won't hold up for much longer.

Don't forget what we're talking about, either. We're talking about accessing Wikimedia over HTTP, not briefing some official on top-secret information. How does "video interviews" solve "a highly distributed network of hacked machines is crawling my website using everyone's local identity"?

johnnyanmac · 2025-04-02T23:04:05 1743635045

>That does not generically 100% solve the problem of "is this person a human".

you don't need 100% generic problem solvling. a "good enough solution" will block out 90% of low effort bad actors, and that's a huge relief by itself. That 9% will take some steps and be combatted, and that last 1% will never truly be held at bay.

jerf · 2025-04-03T13:42:19 1743687739

Your model assumes the solutions aren't shared.

They are.

Hacker News readers tend to grotesquely underestimate the organization of the underworld, since they aren't in it and aren't generally affected by it. But the underworld is large and organized and very well funded. I'm sure you're operating on a mental model where everyone who sets out to scrape Wikimedia is some random hacker off on their own, sitting down to write a web scraper from scratch having never done it before and not being good at, and being just gobsmacked by the first anti-scraping tech they find, frantically searching online for how to bypass it and coming up with nothing.

That's not how the world works. Look around you, after all; you can already see the evidence of how untrue this is even in the garbage you find for yourself. You can see it in your spams, which never have problems finding hacked systems to put their forged login pages on. That's because the people sending the spam aren't hacking the systems themselves... they use a Hacked System As A Service provider. And I am not saying that sarcastically... that's exactly what they are. APIs and all. Bad actors do not sit down with a fresh college grad and a copy of Python for Dummies to write crawlers in the general case. (Some do, but honestly they're not the worrying ones.) They get Black Web Scraping As A Service, which is a company that can and does pay people full time to figure out how to get around blocks and limits, and when you see people asking questions about how to do that online, you're not seeing the pros. The pros don't ask those questions on Stack Exchange. They just consult their fellow employees, like any other business, because it's a business.

You could probably mentally model the collection of businesses I'm referring to as at least as large as Microsoft or Google, and generally staffed by people as intelligent.

It in fact does need to be a nearly 100% solution, because any crack will be found, exploited, and not merely "shared" but bought and sold freely, in a market that incentivizes people with big payouts to find the exploits.

I really wish people would understand this, the security defense team in the world is grotesquely understaffed and mismanaged collectively because people still think they're going up against some stereotypical sweaty guy in a basement who might get bored and wander away from hacking your site if he discovers women, rather than funded professionals attacking and spamming and infiltrating and getting paid large amounts of money to do it.

karn97 · 2025-04-02T13:15:59 1743599759

Why not just rate limit every user to realistic human rates. You just punish anyone behaving like a bot.

mrweasel · 2025-04-02T17:06:35 1743613595

Because, as pointed out in another post about the same problem: Many of these scrappers make one or two requests from one IP and then move on.

guerrilla · 2025-04-02T13:33:09 1743600789

Sold. Pay by page retrieval rate.

scoofy · 2025-04-02T18:35:52 1743618952

Honeypots in JS and CSS

I've been dealing with this over at golfcourse.wiki for the last couple years. It fucking sucks. The good news is that all the idiot scrapers who don't follow robots.txt seem to fall for the honeypots pretty easily.

Make the honeypot disappear with a big CSS file, make another one disappear with a JS file. Humans aren't aware they are there, bots won't avoid them. Programming a bot to look for visible links instead of invisible links is challenging. The problem is these programmers are ubiquitous, and since they are ubiquitous they're not going to be geniuses.

Honeypot -> autoban