That's a great way of framing it. It's my expectation that we will ruin the internet as a useful training corpus by flooding it with generated articles, and we will end up with a pre-AI date we use to filter incoming data in order to avoid them.
I wouldn't be surprised if filtering regular, pre LLM bot spam was already a massive hurdle when collating data for ChatGPT.
I wouldn't be surprised if filtering regular, pre LLM bot spam was already a massive hurdle when collating data for ChatGPT.