LL and LR Parsing Demystified (2013)

AceJohnny2 · on Sept 22, 2016

What about Earley's algorithm?

Unfortunately, given the extremely limited hardware of 1960s computers (not helped by the lack of an efficient algorithm), the parsing of an arbitrary CFG was too slow to be practical. Parsing algorithms such as LL, LR, and LALR identified subsets of the full class of CFGs that could be efficiently parsed. Later, relatively practical algorithms for parsing any CFG appeared, most notably Earley's 1973 parsing algorithm. It is easy to overlook the relative difference in performance between then and now: the fastest computer in the world from 1964-1969 was the CDC6600 which executed at around 10 MIPS; my 2010 mobile phone has a processor which runs at over 2000 MIPS. By the time computers had become fast enough for Earley's algorithm, LL, LR, and friends had established a cultural dominance which is only now being seriously challenged - many of the most widely used tools still use those algorithms (or variants) for parsing. Nevertheless in tools such as ACCENT / ENTIRE and recent versions of bison, one has access to performant parsers which can parse any CFG, if that is needed.

from: http://tratt.net/laurie/blog/entries/parsing_the_solved_prob...

Featured here 6 years ago: https://news.ycombinator.com/item?id=2327313

fernly · on Sept 22, 2016

Thanks for the reference. For those like me who'd not heard of it, see [1] for a description and links to many implementations.

[1] https://en.wikipedia.org/wiki/Earley_parser

AceJohnny2 · on Sept 23, 2016

I missed that haberman (author of this article) had also commented in that thread extensively:

https://news.ycombinator.com/item?id=2328627

haberman · on Sept 21, 2016

Happy to see this show up here again. It's one of my articles that I'm most proud of.

I'm happy to answer any questions about it.

Shawnecy · on Sept 21, 2016

> A planned future article will break open the black box for more details about the inner workings of these algorithms.

Was this article ever written?

haberman · on Sept 21, 2016

Good question. Not so far. I can't quite see how to do it without getting bogged down in details. The "traditional" algorithms and grammar classes (Strong LL, Full LL, LR(0), etc) are complicated, and yet few people use these. The ones people actually do use (LL(star), ALL(star), GLR, IELR, LALR, etc) tend to be even more complicated.

Maybe the thing to do is to illustrate a few of the most basic algorithms by example, just to give people a taste of what they look like.

devty · on Sept 22, 2016

Out of curiosity, where would one find an explanation (potentially dry, long, and complicated) of the algos you mention here?

In other words - how did you get to know them?

barrkel · on Sept 22, 2016

If you implement a top-down recursive descent parser, you'll quickly learn how it maps to LL. This is best done by following tutorials and reading and rewriting simple expression parsers. There's no substitute to doing and experimentation. No short paragraph is going to give you the insight.

Build a parse tree in a recursive descent parser, and you'll notice that you build it from the top down - even if you actually construct from the bottom up, the stack frames that construct the parent nodes are already on the stack when the bottom is constructed.

Once you understand LL, LR can be explained by looking at it the other way around: building the parse tree from the bottom up without any higher-up context. Simply states and transitions based on the token stream. The details of state machine is generally best left to a parser generator though, unless you want to write your own.

Once you understand the idea of LR, GLR is easy to understand: instead of a single tree, GLR parses a forest of trees, which it can thin out as it eliminates possibilities with more context. This gives GLR parsing a lot of power, at the cost of higher algorithmic complexity (how bad depends on the grammar).

And so on. The details of other algorithm variants can mostly be understood in relation to these core approaches.

haberman · on Sept 22, 2016

> If you implement a top-down recursive descent parser, you'll quickly learn how it maps to LL.

This will indeed tell you how top-down parsers work. But this doesn't tell you very much about LL specifically.

The trickiest part of parsing is deciding what path to take. When you hand-write a recursive descent parser, your code for deciding between alternatives is ad hoc. For example if you are parsing JSON you'll have some code that looks like this:

    if next_token == "{":
      parse_object()
    elif next_token == '[':
      parse_array()
    elif next_token == 'null':
      parse_null()
    elif next_token == 'true' or next_token == 'false':
      parse_boolean()
    elif next_token == '"':
      parse_string()
    else:
      parse_number()

A person writing a recursive descent parser has to invent this chain of "if" tests manually and convince him/herself that it is correct. For JSON it's simple because JSON is LL(1). For other languages it can be much more complicated, and can involve multiple tokens of lookahead.

Almost all of the "secret sauce" in LL, LR, etc. is how to construct these tables of if/then in a way that is rigorous and provably correct.

It's true that understanding generally how top-down and bottom-up parsers work is a great step towards understanding parsers.

barrkel · on Sept 22, 2016

The first and follow sets for LL are the first hurdle you'll hit when you try to parse something where the next production (the next path) doesn't consume the token straight away. The theory becomes instantly accessible for practical reasons. So I respectfully disagree.

You'll quickly learn there's a mechanical transformation from LL(1) grammars to recursive descent. And if you write a recursive descent parser with one token of lookahead, no backtracking, and straightforward syntax actions, it has a mechanical translation into LL(1).

Anyway, I described the journey I went through learning compiler theory before I went to college. I covered undergrad compiler course content long before I left secondary school, and understood it at a much deeper level. I got a job at Borland working on the Delphi compiler (6 years I put in) on the back of the same knowledge. It worked for me.

haberman · on Sept 22, 2016

Sorry for being a bit pedantic about it. People sometimes write hand-written recursive descent parsers and call them LL parsers, which is sorta true at some level, but misses IMO the key ingredient of automatically-constructed parse tables. I may have been a bit over-eager in reiterating this distinction.

How did you like working on Delphi? I wish I had been studying this stuff that early!

barrkel · on Sept 24, 2016

It was educational, to say the least. I ended up working mostly remotely (H-1B visa, when it came through, I didn't follow through on - I stayed in London with my GF). Apart from learning lots of implementation tricks in compilers (IDE / code completion support especially) and efficient code, as well as picking up experience in things that were more incidental to my immediate interests, like smart linking (turns out, it's very similar to garbage collection) and x64 disassembly for the 64-bit debugger.

More generally I learned how to tackle very complex problems under my own steam, even more so than I had as an autodidact. I have no difficulty drilling down to the next level, all the way to the CPU if necessary, if I have a problem, even if I'm starting out at the level of an interpreted language. I currently work on a web app written in Rails / Coffeescript combo; but I can debug MRI if we get an Ruby issue, and I can and have debugged MySQL when we had crashing issues.

haberman · on Sept 22, 2016

By far my favorite survey book on the subject is "Parsing Techniques: A Practical Guide" by Grune and Jacobs.

Here is my Amazon review of the book: https://www.amazon.com/gp/customer-reviews/R17E19PSPM2UO9

jbn · on Sept 22, 2016

This book seems to be available at : http://dickgrune.com/Books/PTAPG_1st_Edition/

wolfgke · on Sept 22, 2016

This is the first edition. The current edition is the second edition.

cbHXBY1D · on Sept 22, 2016

How does it compare to the Dragon Book?

wolfgke · on Sept 22, 2016

> Out of curiosity, where would one find an explanation (potentially dry, long, and complicated) of the algos you mention here?

> In other words - how did you get to know them?

My book recommendation: Dick Grune, Ceriel J.H. Jacobs - Parsing Techniques: A Practical Guide (Second Edition). Many people will also recommend the "dragon book" (Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman - Compilers: Principles, Techniques, and Tools (Second edition)), which is not a bad book.

If you want to dive deeply into parsing and want to read about all the details over hundreds of pages, the Grune-Jacobs book is clearly the recommdendation. On the other hand, if you just want to understand the theory behind the parsing stage of a compiler (as one part of a compiler - and there are lots of other parts, too), you will probably prefer the dragon book.

jules · on Sept 22, 2016

Earley parsing is simple and general.

devty · on Sept 21, 2016

I see a follow up to the article which provides further motivations - http://blog.reverberate.org/2013/09/ll-and-lr-in-context-why... - but I don't see a blog post diving into the algos.

wtbob · on Sept 22, 2016

I do wish you'd put the XML example next to Lisp, just for laughs:

    <plus>1<times>2 3</times><plus>

Great article, though!

friendzis · on Sept 22, 2016

<expression> <type>PLUS</type> <operands> <expression> <type>TIMES</type> <operands> <operand> <value>1</value> <type>integral</type> </operand> <operand> <value>2</value> <type>integral</type> </operand> <operand> <value>3</value> <type>integral</type> </operand> </operands> </expression> </operands> </expression>

JoeAltmaier · on Sept 22, 2016

   </plus>

wtbob · on Sept 23, 2016

sigh It was morning, and I hadn't yet had my coffee:-)

nickpsecurity · on Sept 22, 2016

People interested in these things should look up GLR and GLL. Pretty powerful when I studied them.

wfunction · on Sept 22, 2016

Funny, I knew GLR but I had never heard of GLL. Thanks for writing this!

nickpsecurity · on Sept 22, 2016

Welcome. :) Here's the link to first paper I saw I think:

http://dotat.at/tmp/gll.pdf

amelius · on Sept 22, 2016

Any recommended links would be very welcome :)

CalChris · on Sept 22, 2016

tl;dr use ANTLR

But actually, I did read this excellent article for the second time. However, unless you are skilled in the art, you should be using ANTLR4, Honey Badger. Terence Parr deserves a Turing award.

For parsing, most upper div compiler classes start off with CFGs, dip briefly into LL recursive descent and then conclude with LALR. If people remember anything, it's SHIFT/REDUCE. Unfortunately, there doesn't seem to be any standard tools for the LL(1) equivalent, FIRST/FOLLOW tables. And as the article shows, they aren't equivalent. Each has strengths but LALR has tools.

Except for ANTLR. Which started out as recursive descent ...

Parsing Techniques doesn't really have any competition. The compiler books, like the compiler classes, can only give a little attention to parsing. Given how strong the available tools are and how hard the remaining subjects are (SSA, code generation, ...) maybe they have a point.

UK-AL · on Sept 22, 2016

Honestly, having written a few a parsers, i prefer PEG with recursive descent, or parser combinator.

I find recursive descent parsers are easy to understand, write, debug(you can drop in a breakpoint a follow it through), and the resultant code is actually readable and clean. It's also fairly easy to add error reporting and recovery. As long as you don't do complicated stuff, they're really fast to.

You don't particularly need to be an expert to write recursive descent parsers either.

exDM69 · on Sept 22, 2016

There are plenty of other parser generators than ANTLR and interesting parsing techniques that fit the bill. Not that there's anything wrong with ANTLR.

The last few time when I've been working on programming language prototypes, I've written the parser using Parsec parser combinators in Haskell. It's super fast and easy to use, however it's more like syntactic sugar for recursive descent parsers than a rigorous parser generator that works for a certain grammar class.

cjhdev · on Sept 22, 2016

I've learned a lot about LALR and GLR using Bison in combination with Ruby.

Bison and Flex are available as apt packages on Ubuntu and there is heaps of documentation.

The Ruby angle means you don't have to roll your own AST structures. Also, Ruby's 'VALUE' type is easily passed around in Bison grammar.

example:

https://github.com/cjhdev/slow_blink/blob/master/etc/slow_bl...

viebel · on Sept 22, 2016

I've copy/pasted the code of your reverse polish evaluator in python in this interactive blog post: http://blog.klipse.tech/python/2016/09/22/python-reverse-pol...

dream-on · on Sept 22, 2016

I fail to see the mystery. In my opinion, state machines are the most straight-forward parts of programming.

There is a learning curve with lex and yacc but if you cannot ever use these programs then what can you really call yourself a programmer?

It's the endless search for a "new" programming language and the layers upon layers of needless abstraction created by today's "programmers" that I find mystifying.

UK-AL · on Sept 22, 2016

using lex and yacc, is not the same thing as knowing how lex and yacc work.

dream-on · on Sept 23, 2016

Very true. But for mere mortals, one thing at a time.