Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They definitely did not implement PDF parsing, even a subset of it. They make some assumptions that will definitely result in incorrect parsing. For instance, they assume, objects are tightly packed. They're not required to. They should be to save space but are not required to. Moreover, it is possible to place objects inside other objects. It's not advised but not prohibited. As far as I can tell this is where their PDF parsing ends. They don't parse the objects themselves (not regular objects, nor stream objects). So they've chosen PDF "because it is the most complicated format to our knowledge" but ended up just (incorrectly) chunking the stream by offset table.


> For instance, they assume, objects are tightly packed. They're not required to. They should be to save space but are not required to.

The PDF 2.0 spec says in section 7.5.3, "The body of a PDF file shall consist of a sequence of indirect objects representing the contents of a document." I'd read that as establishing the entire contents of the file body. Of course, real-world PDFs might have all sorts of garbage that a practical parser should be prepared for, but I don't think that it's condoned by the standard.

> Moreover, it is possible to place objects inside other objects. It's not advised but not prohibited.

I think the standard tokenization would prevent any string "obj" inside of an indirect object from actually being a keyword obj that starts a new indirect object. (And if the file body as a whole weren't tokenized from start to end, then "a sequence of indirect objects" would be nonsensical.)


An object can be placed into a stream without breaking any syntactic or semantic rules.


Yeah this work is far away from what a real PDF parser requires. It’s not uncommon for the lengths at the beginning of streams to be wrong or the data to be encoded in a format different from the one claimed. The offset table can also be wrong or missing.


Malformed file is a whole another can of worms a good parser should know how to deal with but here it doesn't even format compliant.

I think they wanted to demonstrate that their work can slice a stream by offset table, in a declarative fashion. It is a useful property. I think they would've better picked OTF/TTF for demonstration of this particular feature.


Sounds to me like that's more of an issue with the PDF specification than with the work presented in the paper, in which case that's hardly the metric by which we should measure its merit.


I’m not saying PDF is a good format. I’m pointing out that they’ve made a poor choice going for PDF. There are other formats they could’ve used to demonstrate this specific technique. Like OTF/TTF which is a more traditional binary format with a whole range of approaches, including offset tables.


The abstract says "We have used IPGs to specify a number of file formats including ZIP, ELF, GIF, PE, and part of PDF". Sounds to me like they threw PDF in there to test the limits of the technique _after_ using it on a bunch of other unrelated formats.

In fact, the authors state "PDF is picked because it is the most complicated format to our knowledge, which requires some unusual parser behaviors. We did not implement a full PDF parser due to its complexity, but a functional subset to show how IPGs can support some interesting features (...) PDF is a more complicated format. Our IPG grammar for PDF does not support full PDF parsing but focuses on how some interesting features in PDF are supported. As a result, the parser generated from our IPG PDF grammar can parse simple PDF files"


Kinda sounds like you didn’t read what they actually wrote in its entirety but have still taken the first possible chance to jump in and quite aggressively tell them how what they’re doing is wrong.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: