Much of the catalog functionality can be accessed from the fatcat.wiki API (https://api.fatcat.wiki/redoc). Scholar adds a search index over the body content of papers, and we are still thinking through how to make this available through a public API without slowing down query latency even more.
We are friendly with Semantic Scholar, and have used their "open corpus" dumps as one of several URL seed lists for crawling in the past. Their search and discovery tech is more sophisticated than ours is likely to be any time soon (https://medium.com/ai2-blog/building-a-better-search-engine-...). We would love to get to the place where groups like AI2, which are primarily research-oriented, could build on an existing open catalog and corpus, and not need to duplicate time crawling, merging catalogs, cleaning metadata, etc. As of today Microsoft Academic (used by Semantic Scholar) might be a better option.
Want to be thoughtful about ranking signals, and are deeply skeptical of journal impact factor, h-index, and most bibliometrics. "Has this been cited more than a handful of times" seems like a reasonable coarse boost. Hope to include more curated signals, like "won a paper prize", "journal in DOAJ and other reviewed indices", etc.
Have been working on a citation graph, keep an eye out for something about that in coming months. One cool thing we hope to do with the citation graph is find "missing works" not yet in the catalog (eg, don't have a DOI, especially for pre-1990 era).
Do you think this is suitable for bibliometric research? (We don't need citation-graphs). We use Scopus and Web of Science, but I really don't like that we are not able to publish helpful datasets that we extract from these databases.
I think it is in a good place for simple bibliometric queries. The fatcat elasticsearch API is open at https://api.fatcat.wiki/fatcat_release/ (behind a proxy to filter "unsafe" requests). That works pretty well for jupyter notebook style experimentation if you are willing to learn the elasticsearch query DSL for aggregations and things.
I don't think the catalog has high enough metadata quality today for use in published research. There are some glaring errors and omissions when you actually starting digging in. On the other hand, almost all bibliographic catalogs seem to have such problems. Fatcat, by being open and having an API, does have the potential to aggregate corrections, fixes, and contributions directly from researchers over time.
A particular missing piece today is that there is no categorization or "discipline" metadata of almost any type. This sort of metadata is more subjective, and the catalog currently carefully only includes factual information. We will likely start collecting metadata at the journal ("container") level and can trickle that down to papers. Aggregating, editing, and curating that metadata in Wikidata first, then importing to Fatcat, might be the best and most sustainable path forward.
Today I discovered "Open Access Diamond journals" in a report (https://zenodo.org/record/4558704/files/OADJS-Findings.pdf). These are low-scale peer-reviewed free-to-read free-to-publish non-commercial journals, typically supported by Universities or Goverment agencies. They serve diverse communities and are not predatory journals (they are free after all).
The bad news is only half of them use DOI or embed licenses in the metadata. Are they indexed or archived somewhere?
To my surprise, there are more than 350,000 papers published in OA Diamond journals every year, and most journals publish fewer than 25 articles a year.
The internet archive is becoming an alternative good internet. It has a web archive, film archive, software archive, media archive... and now research papers archive. That is the internet as a giant library as we dreamed in early 90's.
Presumably it would be be acquired, paywalled, and monetized by a private equity firm (or some suitably hostile intellectual property rightsholder organization) before going bankrupt and shutting down for good.
Exactly. I encourage everyone to become a digital hoarder yourself. See a cool blog post? Assume it will be GONE in 5-10 years. So make a backup PDF copy, and throw it in dropbox. In 5-10 years if you re-encounter that page, and the internet archive is missing the page, you'll be delighted to find it in your own archive, and you can be the one who restores that information to the world.
Internet Archive strikes again! I love Internet Archive, not just for archiving websites but for archiving everything and making it easily accessible. This is another great service that'll help a lot of researchers and hobby-researchers, which is lovely to see.
This is amazing. I had a play around with it whilst it was in beta, and was blown away by the variety of papers returned. On a whim I searched for a very obscure topic that I'd researched before (just for personal interest) in the past using worldcat / google scholar, and to my surprise was presented with several highly relevant papers I'd never come across before, that were exactly what I was looking for.
archive.org is really one of the few things still good on the internet, while studying it has been invaluable for my studies, I can't imagine what the previous generations that could only access 5% of sources were even doing.
Oh yeah! Tried this on several specific topics I've looked at recently (2 years ago, 7ya, and 150ya) and the results were fast and on the mark. I'll certainly favor using Scholar over IA searches. Congratulations!
How does this compare to BASE and why isn't BASE used as a source?
"BASE is one of the world's most voluminous search engines especially for academic web resources. BASE provides more than 240 million documents from more than 8,000 content providers. You can access the full texts of about 60% of the indexed documents for free (Open Access). BASE is operated by Bielefeld University Library."
BASE, SHARE (https://share.osf.io/), and CORE (https://core.ac.uk) all primarily pull metadata via OAI-PMH, though they may also incorporate other sources these days. We have worked with CORE to check content overlap. We have also done our own OAI-PMH bulk scraping and broad preservation crawling, but most of this content has not ended up indexed in fatcat yet.
The main reason that we haven't done bulk imports from OAI-PMH directly or from any of these sources is that we haven't gotten a handle on the metadata quality yet. There are many, many duplicate records out there, and we think it is important to merge these correctly (under "work" entities in fatcat). Until recently we didn't have a mechanism to fuzzy-match new records to prevent duplicate creation. Once we get a policy figured out and polish the de-dupe code, we expect to significantly increase the amount of content in the catalog from these sources.
Another thing we haven't figured out yet is accurately tagging OAI-PMH feeds as journals (eg, OJS instances) vs. institutional repositories or subject repositories. That distinction will change how we classify imported records (eg, "published" versions vs. "pre-print" or "manuscript").
(OffTopic) All this talk about the logo here made me check the page out, instead of moving on after reading just the comments as I might otherwise have done. Perhaps that's a HN strategy to use, to get people to actually click through - add a bikesheddy thing to the page that's likely to be divisive, but doesn't require thought. Gives us a cheap way to have an opinion, and thus an incentive to click!
Sci-Hub exists specifically to exfiltrate paywalled research papers; IA Scholar is for open-access papers that have disappeared off the Internet. They do different things.
Interesting. For my field (cardiovascular genetics), the results weren't really what I was expecting. I think that my expectations probably fit pretty well with a PageRank graph of citations. So my guess is that the "relevancy" is semantic only?
I couldn't find a list of what sources (like which journals) they're archiving from. Does anyone know where to find that? It would be nice to see what subject categories the archive covers.
We are mostly not indexing on a journal-by-journal basis, but try to import from large, broad sources. For example, DOI registrars (Crossref, Datacite, J-Stage), DOAJ article and journal metadata (for OA publications), etc. Some field-specific indexes we have imported from include JSTOR early journals subset, PubMed, and dblp.
Some fields/disciplines are probably still systemically under-represented. For example, I bet we are missing a bunch of scholarship on art and history published before 1980. We have a couple ideas up our sleeves which we hope will help with "completeness" across more disciplines.
And click through to see how many articles we know about, and what we think the preservation status is. Click through again to the "coverage" tab for a more detailed breakdown. (improving the usability and ranking on the journal search results is on our short list)
I did a vanity search for my own (modest) academic output and found only one paper, which was published by a journal in Europe. The other papers were all published in either Japan or Korea and don’t appear in your search results.
Two large sources in Japan you might consider trying to mirror are the UTokyo Repository [1] and Researchmap [2], through which many researchers in Japan release PDFs of their own papers. Other Japanese universities probably have archives similar to [1].
If you would like me to contact somebody at [1] who might be able to work with you, please let me know in a reply to this comment. (I helped to arrange the IA’s recent tie-up with the University of Tokyo General Library.)
Ah, sorry to hear. We in particular want to include content from outside the US/Europe publishing world.
For Japanese publishing, we have done metadata imports from JaLC (Japanese DOI registrar), and crawled a lot of open content from J-Stage (https://www.jstage.jst.go.jp/) and I hoped that coverage was pretty good. If you get a chance, could you try searching for metadata records on https://fatcat.wiki, with both Japanese and English titles and names (if applicable)?
For Korean publishing, the regional DOI registrar (https://www.kisti.re.kr/eng/) does not provide open metadata, which is a known hole in our coverage. IIRC it looked like there might be a way to scrape at least DOIs, titles, and author names, but haven't had time to take a crack at it.
Mainland Chinese publishing is probably the biggest single hole in coverage by absolute numbers. There are two DOI registrars and neither have open metadata.
Regarding the u-tokyo.ac.jp, it looks like we are able to consume metadata and do crawls via the OAI-PMH protocol. We crawled over 112k URLs from that domain via that protocol about a year ago, and they should be preserved/mirrored in web.archive.org but they haven't ended up in fatcat or scholar yet. We want to go slow with pulling in OAI-PMH content, and ensure we de-duplicate records and add filters to ensure we are getting clean metadata and content. Also preserving repository content hasn't been as urgent as getting to small OA publishers which might lack a preservation scheme and vanish off the web.
Many thanks for the reply. I will contact some colleagues at our university library to ask for suggestions about how to check systematically how comprehensively J-Stage and Fatcat cover research publications in Japan. I will also ask if they have any suggestions about other sources from which the IA might gather such data from Japan. Either I or they will contact you by e-mail.
My subsequent vanity searches at J-Stage and Fatcat weren’t very encouraging. Most of my own papers have appeared in journals published by Japanese university departments or academic societies. While PDFs of the papers appear on the websites of the issuing organizations and show up on Google Scholar, they don’t seem to have DOIs or be listed on J-Stage.
I should mention that my research has mostly been on the humanities side of things, while J-Stage is “an electronic journal platform for science and technology information in Japan” [1].
Related previous post: https://news.ycombinator.com/item?id=24485444
Much of the catalog functionality can be accessed from the fatcat.wiki API (https://api.fatcat.wiki/redoc). Scholar adds a search index over the body content of papers, and we are still thinking through how to make this available through a public API without slowing down query latency even more.
Folks here might also be interested in this CLI for interfacing with the catalog and making edits: https://gitlab.com/bnewbold/fatcat-cli