How does this compare to BASE and why isn't BASE used as a source? "BASE is one ...

bnewbold · on March 11, 2021

Great question!

BASE, SHARE (https://share.osf.io/), and CORE (https://core.ac.uk) all primarily pull metadata via OAI-PMH, though they may also incorporate other sources these days. We have worked with CORE to check content overlap. We have also done our own OAI-PMH bulk scraping and broad preservation crawling, but most of this content has not ended up indexed in fatcat yet.

The main reason that we haven't done bulk imports from OAI-PMH directly or from any of these sources is that we haven't gotten a handle on the metadata quality yet. There are many, many duplicate records out there, and we think it is important to merge these correctly (under "work" entities in fatcat). Until recently we didn't have a mechanism to fuzzy-match new records to prevent duplicate creation. Once we get a policy figured out and polish the de-dupe code, we expect to significantly increase the amount of content in the catalog from these sources.

Another thing we haven't figured out yet is accurately tagging OAI-PMH feeds as journals (eg, OJS instances) vs. institutional repositories or subject repositories. That distinction will change how we classify imported records (eg, "published" versions vs. "pre-print" or "manuscript").

text7263 · on March 11, 2021

Base indexes the metadata only I believe.