Common Crawl isn't open source. It's license has many, specific restrictions. Some are often interpreted in specific ways that could be censoree due to politics. I'm not sure what CC organizations interpretations are, though.
I did think they'd be the best at making the URL part of what I described. Their CC dumps are copyright infringement (file sharing) but links probably would be legal. It would need to be released under an open-source license without their extra terms.
Alternatively, a no-terms license given to specific, paying parties for internal use. Alternatively, released under non-commercial use with low, recurring prices for (a) commercial use and (b) regular updates on link Metadata. Paid alternatives might be good for their funding.
I can't use them right now, though, because I can't guarantee all my or customers' uses meet all likely interpretations of their terms. I'm not even sure how to put best effort in that. I'd rather them just publish their metadata under Apache license or something.
I did think they'd be the best at making the URL part of what I described. Their CC dumps are copyright infringement (file sharing) but links probably would be legal. It would need to be released under an open-source license without their extra terms.
Alternatively, a no-terms license given to specific, paying parties for internal use. Alternatively, released under non-commercial use with low, recurring prices for (a) commercial use and (b) regular updates on link Metadata. Paid alternatives might be good for their funding.
I can't use them right now, though, because I can't guarantee all my or customers' uses meet all likely interpretations of their terms. I'm not even sure how to put best effort in that. I'd rather them just publish their metadata under Apache license or something.