Given how LLMs are trained on its data, it's travesty to prevent the public from accessing it all the same.
I'm thinking it should be distributed using physical media given it's size. A 20 volume encyclopedia on hard drives? We used to do this when the internet was too slow back in the day. I've had friends give me anything from pirated encyclopedias to MSDN docs on CDs. if enough people have enough of the volumes, they could seed and keep it going. But if only a handful of people have the actual data, it's a matter of time before it's taken offline for good.
If you want to have the whole thing, it'd probably take dozens of hard drives each of which needs several hours to copy. For scale, individual chunks of the archive are several gigabytes, and there are thousands of chunks. It's not like that torrent with 1000 books you got in the day. There are millions of books and god knows what else in the archive.
I know. I've seen them as high as 24 TB. You'd still probably need at least a dozen of the highest capacity on the market. I believe the guys running the archive even have tape drives but the right kind of tape drive costs a fortune.
They could let buyers pay for the cost. Classify them based on topic category. The objective is, if enough of the drives exist out there, stopping it's distribution/access becomes a futile effort.
I don't think you understand how big of an ask that is in the US or any Western country. What you're proposing is in-person bootlegging with extra steps, but with very expensive equipment. There may be too many internet pirates to bother with, but the government will raid you for something like this. The more copies you make, the bigger the response is. If you're going to break the law it's best that you not be flagrant about it. The US government has scooped people in other countries that were thought to be neutral for taking a big role in piracy.
I could be wrong, but the laws apply just the same as a download. perhaps the scrutiny might be higher though. but over pdfs? I don't know. The shipping details could be tricky too, but dicier things are shipped from certain market places.
The fact it's PDF files is of little consequence. It's bootlegging. If it became popular, you would get raided for distributing copies. It doesn't happen for small-scale transfers between friends but if you start doing it on a large enough scale then you're asking for trouble. People have had similar ideas since recorded media has existed. Pirate copies of CDs, books, movies, etc. are sold freely in countries where IP laws are lax, and generally not sold in countries where these laws are enforced. The risk of getting caught could be lowered quite a bit by taking precautions, but if you're actively trying to be a large-scale pirate then you're bound to run into trouble eventually. Enforcement is focused on distributors rather than consumers usually. But with AI and increasingly invasive tech, that could change.
I'm thinking it should be distributed using physical media given it's size. A 20 volume encyclopedia on hard drives? We used to do this when the internet was too slow back in the day. I've had friends give me anything from pirated encyclopedias to MSDN docs on CDs. if enough people have enough of the volumes, they could seed and keep it going. But if only a handful of people have the actual data, it's a matter of time before it's taken offline for good.