Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Internet Archive Scholar: Search Millions of Research Papers (archive.org)
342 points by bnewbold on March 9, 2021 | hide | past | favorite | 47 comments


This service was hinted at back in September, but is now formally announced and live at https://scholar.archive.org

Related previous post: https://news.ycombinator.com/item?id=24485444

Much of the catalog functionality can be accessed from the fatcat.wiki API (https://api.fatcat.wiki/redoc). Scholar adds a search index over the body content of papers, and we are still thinking through how to make this available through a public API without slowing down query latency even more.

Folks here might also be interested in this CLI for interfacing with the catalog and making edits: https://gitlab.com/bnewbold/fatcat-cli


I absolutely love everything about it (the logo <3).

Super fast. All my test searches returned what I was looking for.

What is your relationship with semantic scholar like?

Any plans to integrate ranking signals like references, etc?

I'm going to double my monthly donation. This is great.


Thank you for the kind words!

We are friendly with Semantic Scholar, and have used their "open corpus" dumps as one of several URL seed lists for crawling in the past. Their search and discovery tech is more sophisticated than ours is likely to be any time soon (https://medium.com/ai2-blog/building-a-better-search-engine-...). We would love to get to the place where groups like AI2, which are primarily research-oriented, could build on an existing open catalog and corpus, and not need to duplicate time crawling, merging catalogs, cleaning metadata, etc. As of today Microsoft Academic (used by Semantic Scholar) might be a better option.

Want to be thoughtful about ranking signals, and are deeply skeptical of journal impact factor, h-index, and most bibliometrics. "Has this been cited more than a handful of times" seems like a reasonable coarse boost. Hope to include more curated signals, like "won a paper prize", "journal in DOAJ and other reviewed indices", etc.

Have been working on a citation graph, keep an eye out for something about that in coming months. One cool thing we hope to do with the citation graph is find "missing works" not yet in the catalog (eg, don't have a DOI, especially for pre-1990 era).


I really like what you have done.

One easy improvement "Showing results 16 — 30 out of 26 results" :-) showing below search results...

> Hope to include more curated signals, like "won a paper > prize", "journal in DOAJ and other reviewed indices", etc. This would be a great addition.


Fixed, thanks!


Thank you for this great datasource!

Do you think this is suitable for bibliometric research? (We don't need citation-graphs). We use Scopus and Web of Science, but I really don't like that we are not able to publish helpful datasets that we extract from these databases.


I think it is in a good place for simple bibliometric queries. The fatcat elasticsearch API is open at https://api.fatcat.wiki/fatcat_release/ (behind a proxy to filter "unsafe" requests). That works pretty well for jupyter notebook style experimentation if you are willing to learn the elasticsearch query DSL for aggregations and things.

I don't think the catalog has high enough metadata quality today for use in published research. There are some glaring errors and omissions when you actually starting digging in. On the other hand, almost all bibliographic catalogs seem to have such problems. Fatcat, by being open and having an API, does have the potential to aggregate corrections, fixes, and contributions directly from researchers over time.

A particular missing piece today is that there is no categorization or "discipline" metadata of almost any type. This sort of metadata is more subjective, and the catalog currently carefully only includes factual information. We will likely start collecting metadata at the journal ("container") level and can trickle that down to papers. Aggregating, editing, and curating that metadata in Wikidata first, then importing to Fatcat, might be the best and most sustainable path forward.


Today I discovered "Open Access Diamond journals" in a report (https://zenodo.org/record/4558704/files/OADJS-Findings.pdf). These are low-scale peer-reviewed free-to-read free-to-publish non-commercial journals, typically supported by Universities or Goverment agencies. They serve diverse communities and are not predatory journals (they are free after all).

The bad news is only half of them use DOI or embed licenses in the metadata. Are they indexed or archived somewhere?

To my surprise, there are more than 350,000 papers published in OA Diamond journals every year, and most journals publish fewer than 25 articles a year.


The internet archive is becoming an alternative good internet. It has a web archive, film archive, software archive, media archive... and now research papers archive. That is the internet as a giant library as we dreamed in early 90's.


Way too centralized (Centranet?), but it is very nice for now. It's a bit like the library of Alexandria, so it could change/disappear at any time.


> It's a bit like the library of Alexandria, so it could change/disappear at any time.

The irony here is that the only second full copy of the Internet Archive is actually hosted at the library of Alexandria.

Source: Digital Amnesia Documentary [1]

[1] https://www.youtube.com/watch?v=NdZxI3nFVJs


I'm sure they'd be willing to decentralize it if there was a good way to do that. Maybe this can be done with something like IPFS [0].

[0] https://ipfs.io/


Already exists, guys. You're late to the party.

https://dweb.archive.org/archive.html?identifier=home


The Internet Archive already stores some big public domain data sets in IPFS/Filecoin: Prelinger Films & Librivox audiobooks. They've been partnering with Protocol Labs for 5 years. https://blog.archive.org/2020/10/22/what-information-should-...


Yes, they have very good intentions right now, but what if the leader gets hit by a bus.


Presumably it would be be acquired, paywalled, and monetized by a private equity firm (or some suitably hostile intellectual property rightsholder organization) before going bankrupt and shutting down for good.

Thanks for an incredible journey.


Good thing is Internet Archive is a nonprofit, so cannot be acquired.


Wikipedia is a not for profit and it's still been acquired, just by people who are insane instead of rich.

Until everyone can own their own copy and moderate it the dream of an open network is just that: a dream.


The amount of data is absolutely insane.


Exactly. I encourage everyone to become a digital hoarder yourself. See a cool blog post? Assume it will be GONE in 5-10 years. So make a backup PDF copy, and throw it in dropbox. In 5-10 years if you re-encounter that page, and the internet archive is missing the page, you'll be delighted to find it in your own archive, and you can be the one who restores that information to the world.


Is it easy to have a local copy?


Internet Archive strikes again! I love Internet Archive, not just for archiving websites but for archiving everything and making it easily accessible. This is another great service that'll help a lot of researchers and hobby-researchers, which is lovely to see.

Don't forget to donate if you also like Internet Archive, they need every penny: https://archive.org/donate/?origin=hn


This is amazing. I had a play around with it whilst it was in beta, and was blown away by the variety of papers returned. On a whim I searched for a very obscure topic that I'd researched before (just for personal interest) in the past using worldcat / google scholar, and to my surprise was presented with several highly relevant papers I'd never come across before, that were exactly what I was looking for.


This seems pretty good.

In computer science we are pretty lucky because open access is the norm.

I checked a few well known exceptions, and this seems to find them ok.

"Mastering the game of Go without human knowledge" (Deepmind in Nature): https://scholar.archive.org/search?q=key:work_yqdj7vjbefg7hh...

"Typing candidate answers using type coercion" (IBM Watson special edition, IEEE IBM Systems Journals): https://scholar.archive.org/search?q=key:work_dym4lqay5fcdxo...


archive.org is really one of the few things still good on the internet, while studying it has been invaluable for my studies, I can't imagine what the previous generations that could only access 5% of sources were even doing.


Oh yeah! Tried this on several specific topics I've looked at recently (2 years ago, 7ya, and 150ya) and the results were fast and on the mark. I'll certainly favor using Scholar over IA searches. Congratulations!


How does this compare to BASE and why isn't BASE used as a source?

"BASE is one of the world's most voluminous search engines especially for academic web resources. BASE provides more than 240 million documents from more than 8,000 content providers. You can access the full texts of about 60% of the indexed documents for free (Open Access). BASE is operated by Bielefeld University Library."

https://www.base-search.net/


Great question!

BASE, SHARE (https://share.osf.io/), and CORE (https://core.ac.uk) all primarily pull metadata via OAI-PMH, though they may also incorporate other sources these days. We have worked with CORE to check content overlap. We have also done our own OAI-PMH bulk scraping and broad preservation crawling, but most of this content has not ended up indexed in fatcat yet.

The main reason that we haven't done bulk imports from OAI-PMH directly or from any of these sources is that we haven't gotten a handle on the metadata quality yet. There are many, many duplicate records out there, and we think it is important to merge these correctly (under "work" entities in fatcat). Until recently we didn't have a mechanism to fuzzy-match new records to prevent duplicate creation. Once we get a policy figured out and polish the de-dupe code, we expect to significantly increase the amount of content in the catalog from these sources.

Another thing we haven't figured out yet is accurately tagging OAI-PMH feeds as journals (eg, OJS instances) vs. institutional repositories or subject repositories. That distinction will change how we classify imported records (eg, "published" versions vs. "pre-print" or "manuscript").


Base indexes the metadata only I believe.


This is nice! I just managed to find an article, I couldn’t find with Google.

Thus, I was able to solve the PDP-1 "Amherst Mystery" [1]: https://www.masswerk.at/nowgobang/2021/pdp1-spotting#update

[1] https://news.ycombinator.com/item?id=26313124


Google fails to index so many good sources nowadays... I think that it has been gotten worst over the last 10 years.


See also https://fatcat.wiki (which is, I think, incorporated into Internet Archive Scholar.)


(OffTopic) All this talk about the logo here made me check the page out, instead of moving on after reading just the comments as I might otherwise have done. Perhaps that's a HN strategy to use, to get people to actually click through - add a bikesheddy thing to the page that's likely to be divisive, but doesn't require thought. Gives us a cheap way to have an opinion, and thus an incentive to click!


What are the differences and advantages over Sci-Hub?


Sci-Hub exists specifically to exfiltrate paywalled research papers; IA Scholar is for open-access papers that have disappeared off the Internet. They do different things.


Interesting. For my field (cardiovascular genetics), the results weren't really what I was expecting. I think that my expectations probably fit pretty well with a PageRank graph of citations. So my guess is that the "relevancy" is semantic only?


I'm curious, how does the Internet Archive handle copyright with all of its services?


#endCopyright


I couldn't find a list of what sources (like which journals) they're archiving from. Does anyone know where to find that? It would be nice to see what subject categories the archive covers.


We are mostly not indexing on a journal-by-journal basis, but try to import from large, broad sources. For example, DOI registrars (Crossref, Datacite, J-Stage), DOAJ article and journal metadata (for OA publications), etc. Some field-specific indexes we have imported from include JSTOR early journals subset, PubMed, and dblp.

Some fields/disciplines are probably still systemically under-represented. For example, I bet we are missing a bunch of scholarship on art and history published before 1980. We have a couple ideas up our sleeves which we hope will help with "completeness" across more disciplines.

To answer your question directly, you can search journal names here: https://fatcat.wiki/container/search

And click through to see how many articles we know about, and what we think the preservation status is. Click through again to the "coverage" tab for a more detailed breakdown. (improving the usability and ranking on the journal search results is on our short list)


I did a vanity search for my own (modest) academic output and found only one paper, which was published by a journal in Europe. The other papers were all published in either Japan or Korea and don’t appear in your search results.

Two large sources in Japan you might consider trying to mirror are the UTokyo Repository [1] and Researchmap [2], through which many researchers in Japan release PDFs of their own papers. Other Japanese universities probably have archives similar to [1].

If you would like me to contact somebody at [1] who might be able to work with you, please let me know in a reply to this comment. (I helped to arrange the IA’s recent tie-up with the University of Tokyo General Library.)

[1] https://repository.dl.itc.u-tokyo.ac.jp/?lang=english

[2] https://researchmap.jp/?lang=en


Ah, sorry to hear. We in particular want to include content from outside the US/Europe publishing world.

For Japanese publishing, we have done metadata imports from JaLC (Japanese DOI registrar), and crawled a lot of open content from J-Stage (https://www.jstage.jst.go.jp/) and I hoped that coverage was pretty good. If you get a chance, could you try searching for metadata records on https://fatcat.wiki, with both Japanese and English titles and names (if applicable)?

For Korean publishing, the regional DOI registrar (https://www.kisti.re.kr/eng/) does not provide open metadata, which is a known hole in our coverage. IIRC it looked like there might be a way to scrape at least DOIs, titles, and author names, but haven't had time to take a crack at it.

Mainland Chinese publishing is probably the biggest single hole in coverage by absolute numbers. There are two DOI registrars and neither have open metadata.

Regarding the u-tokyo.ac.jp, it looks like we are able to consume metadata and do crawls via the OAI-PMH protocol. We crawled over 112k URLs from that domain via that protocol about a year ago, and they should be preserved/mirrored in web.archive.org but they haven't ended up in fatcat or scholar yet. We want to go slow with pulling in OAI-PMH content, and ensure we de-duplicate records and add filters to ensure we are getting clean metadata and content. Also preserving repository content hasn't been as urgent as getting to small OA publishers which might lack a preservation scheme and vanish off the web.


Many thanks for the reply. I will contact some colleagues at our university library to ask for suggestions about how to check systematically how comprehensively J-Stage and Fatcat cover research publications in Japan. I will also ask if they have any suggestions about other sources from which the IA might gather such data from Japan. Either I or they will contact you by e-mail.

My subsequent vanity searches at J-Stage and Fatcat weren’t very encouraging. Most of my own papers have appeared in journals published by Japanese university departments or academic societies. While PDFs of the papers appear on the websites of the issuing organizations and show up on Google Scholar, they don’t seem to have DOIs or be listed on J-Stage.

I should mention that my research has mostly been on the humanities side of things, while J-Stage is “an electronic journal platform for science and technology information in Japan” [1].

[1] https://www.jstage.jst.go.jp/static/pages/JstageOverview/-ch...


This is great feedback, thank you.

For future follow-up, my work email is my handle here (bnewbold) at archive.org


[flagged]


Here is an appropriate soundtrack for browsing the results:

https://www.youtube.com/watch?v=x8gBfEDoEbY


If you're going to judge it by the logo rather than by the search results, it almost certainly is not for you...


I had the exact opposite reaction. That logo is fabulous.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: