Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How does this compare to BASE and why isn't BASE used as a source?

"BASE is one of the world's most voluminous search engines especially for academic web resources. BASE provides more than 240 million documents from more than 8,000 content providers. You can access the full texts of about 60% of the indexed documents for free (Open Access). BASE is operated by Bielefeld University Library."

https://www.base-search.net/



Great question!

BASE, SHARE (https://share.osf.io/), and CORE (https://core.ac.uk) all primarily pull metadata via OAI-PMH, though they may also incorporate other sources these days. We have worked with CORE to check content overlap. We have also done our own OAI-PMH bulk scraping and broad preservation crawling, but most of this content has not ended up indexed in fatcat yet.

The main reason that we haven't done bulk imports from OAI-PMH directly or from any of these sources is that we haven't gotten a handle on the metadata quality yet. There are many, many duplicate records out there, and we think it is important to merge these correctly (under "work" entities in fatcat). Until recently we didn't have a mechanism to fuzzy-match new records to prevent duplicate creation. Once we get a policy figured out and polish the de-dupe code, we expect to significantly increase the amount of content in the catalog from these sources.

Another thing we haven't figured out yet is accurately tagging OAI-PMH feeds as journals (eg, OJS instances) vs. institutional repositories or subject repositories. That distinction will change how we classify imported records (eg, "published" versions vs. "pre-print" or "manuscript").


Base indexes the metadata only I believe.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: