Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Dataset source: wikidump Date: Feb 7, 2019 docs: 5.6M size: 5.3 GB

"wikidump" links to https://dumps.wikimedia.org/enwiki/latest/ , which has thousands of files, none of which are 5GB and make sense. That's a very poor corpus link!

It says "Feb 7, 2019", so it probably means https://dumps.wikimedia.org/enwiki/20190120/ or https://dumps.wikimedia.org/enwiki/20190201/ ... maybe. They don't have any obvious 5.3GB files.



Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: