Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The next step would be digitizing the books and manuscripts so scholars can collectively research the finding.

https://www.medievalists.net/?s=digitizing&submit=Search

I wonder what the cost of this digitization process would be and what research labs can render this service.



I'm wondering the other side of it: given how fragile digital storage and peripherals are, are there efforts to transcribe books like this onto archival paper with archival inks? Seems it really would be kinda fun to have a modern day monastery copying books by hand like in the ancient times...


I'm not sure how common they are or precisely where it came from but I know someone with a hand-copied prayer book from the indian melankara church. I'm not sure when the original was made but this one was copied in the 1960s so is nearly a minor relic in its own right.


> given how fragile digital storage and peripherals are

Post it on the web. Lots of people will inevitably make copies, ensuring its survival.


In my experience, even for things of some value, this isn't true. I'd perhaps go as far as saying this (understandably) point of view is dangerously misleading.

The two greatest losses I've observed have been the dismemberment of Usenet, and the deprecation of the FTP protocol.

Google's purchase of DejaNews was viewed at the time as a likely win: Google had the budget and the right ethos to preserve the history intact. And perhaps that discouraged others from doing their own preservation? As time has progressed, I've found that Google's archive is missing a lot, and there are precious few other sources. Lots of useful stuff is lost.

More recently, with the major browsers deprecating the FTP protocol, many of the large software archives have closed down due to reduced usage. Many, many things that used to be in the main archives are very difficult or impossible to find now, and lots are likely lost.

As examples, I was recently attempting to find the source code for various Mach (operating system) variants. It was quite difficult to find anything, and many things I know existed from personal experience seem impossible to locate now.

Similarly, some of the early releases of Unix appear to be lost. Various old tapes have been found, but despite being a high-profile item, we've yet to discover copies of key historical releases (see TUHS.org for what does exist).

We need to begin to actively curate and manage our history, or it will disappear.


The key element for both distributed storage and open source codebases appears to be refresh/review frequency.

Which is similar, but slightly different than popularity.

Essentially "With what frequency will someone put eyeballs on and attempt to use this thing"?

Below some critical threshold, integrity is compromised by freak events (hard drive crash, server goes down, host/maintainer decides to retire, etc.) and the material is corrupted/lost before it can be replicated.

E.g. re-hosting that thing that 10,000 people still have copies of, because they noticed it went down yesterday vs. that thing that disappeared a year ago and folks are just noticing


My old BBS system from the 80s still exists on a 200Mb hard drive, but I no longer have any way to read that drive. Who knows what's on my other ancient hard drives :-)

A friend of mine was able to rescue all my software for the PDP-11 from 8" floppies. I put what was mine on github.

A few years ago I tried to power up my old computers. None of them would power up. A couple made a popping sound and smoke came out. Probably all bad caps.

I did find a box of 20 or so zip drives circa 2000, and my old zip drive, which amazingly still worked. I copied everything off onto modern media.


I was just thinking about that. In my opinion, this find is sorta useless if these aren't digitalized and shared publicly.

To my knowledge, digitalization can be expensive, because they need hardware for high quality scans, and they have to be careful not to damage these books any further. I guess it all depends on the situation.


apropo username, having taken a crack at pulling relevant information out of scanned documents I agree that scan quality is very important (while often lengthy and expensive) especially if someone is trying to derive meaningful information from a digital copy without the physical copy to do a comparison with.

And from the look of the picture those books are massive and probably very delicate.

EDIT: to add a bit to the expensive part of this, it's expensive even with the willingness and resources to get it done, it's hard but unfortunately to even convince someone to dedicate these resources is a hurdle.


Ah, baloney. If you can open the book, you can photograph it with your iphone. You'll find the result answers your concerns. Try it with any of your books.


That reminds me, I have an out-of-copyright book by a namesake where I took the photos years ago — before I had a smartphone let alone one with built-in OCR — and still have not gotten around to transferring the text to wiki… source? wikibooks? One of them.

I should do that.


Your phone camera, hand held, is plenty good enough to digitize each page. Even if they don't lay flat. You could pay a student to just photograph each page. The cost is minimal.

Before anyone says "this will never work! It must be done by $$$$$ professionals! It requires $$$$ equipment!" just pick a book, any book, off your bookshelf, open it up, and take a phone photo.

P.S. It works better with daylight providing enough light through the windows.


You make a good point, but there also could be more to it than that:

- Need to make sure the photographers are careful not to damage fragile pages

- Need a system of organization (syncing ten thousand default-named iphone pics with no labels is not ideal)

- You might be ignoring important differences between modern published books on your bookshelf and these materials (ex. maybe font is not same size, maybe font is not modern English, maybe characters are not printed consistently, maybe pages are dirty, all of which could impact OCR-friendliness of an iphone pic compared to something else

- There might even be valuable information in markings below the topmost visible layer which could be revealed by scanning equipment (especially for example if pages are stuck together)

And that's just off the top of my head, without real domain knowledge.


It's not about OCR or dirt. It's about taking an image. I doubt OCR would work on any of them, whether you use a $$$$$ archivist to photograph the pages or not.

As for below the topmost layer, you're right, an iphone camera won't do it. But worrying about that comes much, much later.


Scantailor Advanced will also help process the images into something resembling a readable scan.

But indeed, as long as you have some images you can dump then onto the Internet Archive for immediate posterity (and hope they don't go under when the lawsuit determines a penalty).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: