I expect we'll see some AI companies in the future throwing away the training dataset. Maybe some have already.
During a court case, the other side can demand discovery over your training dataset, for example to see if it contains a particular copyrighted work.
But if you've already deleted the dataset, you're far more likely to win any case against you that hinges on what was in the dataset if the plaintiff can't even prove their work was included.
And you can argue that the dataset was very expensive to store (which is true), and therefore deleted shortly after training was complete. You have no obligation to keep something for the benefit of potential future plaintiffs you aren't even aware of yet.
During a court case, the other side can demand discovery over your training dataset, for example to see if it contains a particular copyrighted work.
But if you've already deleted the dataset, you're far more likely to win any case against you that hinges on what was in the dataset if the plaintiff can't even prove their work was included.
And you can argue that the dataset was very expensive to store (which is true), and therefore deleted shortly after training was complete. You have no obligation to keep something for the benefit of potential future plaintiffs you aren't even aware of yet.