How is PDF "machine hostile" ? It's well documented and trivial to extract text ...

2sk21 · on Oct 20, 2020

This is far from an easy problem - I have been struggling with this over the past few years. tools like pdftotext only work with the most simply formatted documents. In contrast, the typical PDFs generated in many business organizations is almost impossible to extract text from, especially with insets, callouts and multiple columns. Amazon's Textract is one of the best PDF text extractors out there and even that struggles with many ordinary looking PDFs

tgflynn · on Oct 20, 2020

It seems like your complaint is that 2 dimensional data is more difficult to work with than 1 dimensional data, not a problem with the format itself.

kd0amg · on Oct 20, 2020

How about that the format strongly encourages use of 2D representation even for fundamentally 1D data?

tgflynn · on Oct 20, 2020

If the data is fundamentally 1D then why is it arranged with insets and multiple columns ?

I think the answer is that it's arranged that way because that arrangement makes it easier for humans to interpret. Because humans prefer to see data in such 2D arrangements, file formats like PDF were developed so that 2D layouts could be defined and preserved across platforms.

Hence the fundamental problem here is that we want computers to be able to read data that is designed for human readability. That is not an easy problem and it is in fact a large part of what people mean when they talk about "AI".

kd0amg · on Oct 20, 2020

I'd push that a step further and say the problem is about data that's been rendered into a different form for human readability. The plain serial text of a magazine article exists in principle and likely wasn't written with ideas like column splits and image-wrapping at the front of the author's mind. 2sk21 advocates shipping that serial form instead of something image-like. That's what's machine-hostile about PDF: it makes the least semantically significant parts easy for the machine to discern at the cost of obfuscating the really important ideas.

Though as a counterpoint, separating content from presentation is already a fairly common practice.

tgflynn · on Oct 20, 2020

I think it depends a great deal on what type of data you're talking about. If it's a magazine article then yes it's essentially a stream of text and 2D formatting is probably secondary. However if it's tabular data then the 2D formatting is essential to its interpretation.

A lot of the text we encounter in the world does depend to a significant extent on 2D layout to facilitate its comprehension. Examples include many (perhaps most) websites, product labels, tax forms and receipts.

None of that is the fault of the PDF format which is intended to faithfully represent 2D layouts. As you mention, if the goal is accurate labeling of content other formats exist which are more suitable to that purpose (though I think the concept of separating content from presentation has always worked better in theory than in practice).