Unfortunately, given the extremely limited hardware of 1960s computers (not helped by the lack of an efficient algorithm), the parsing of an arbitrary CFG was too slow to be practical. Parsing algorithms such as LL, LR, and LALR identified subsets of the full class of CFGs that could be efficiently parsed. Later, relatively practical algorithms for parsing any CFG appeared, most notably Earley's 1973 parsing algorithm. It is easy to overlook the relative difference in performance between then and now: the fastest computer in the world from 1964-1969 was the CDC6600 which executed at around 10 MIPS; my 2010 mobile phone has a processor which runs at over 2000 MIPS. By the time computers had become fast enough for Earley's algorithm, LL, LR, and friends had established a cultural dominance which is only now being seriously challenged - many of the most widely used tools still use those algorithms (or variants) for parsing. Nevertheless in tools such as ACCENT / ENTIRE and recent versions of bison, one has access to performant parsers which can parse any CFG, if that is needed.
Good question. Not so far. I can't quite see how to do it without getting bogged down in details. The "traditional" algorithms and grammar classes (Strong LL, Full LL, LR(0), etc) are complicated, and yet few people use these. The ones people actually do use (LL(star), ALL(star), GLR, IELR, LALR, etc) tend to be even more complicated.
Maybe the thing to do is to illustrate a few of the most basic algorithms by example, just to give people a taste of what they look like.
If you implement a top-down recursive descent parser, you'll quickly learn how it maps to LL. This is best done by following tutorials and reading and rewriting simple expression parsers. There's no substitute to doing and experimentation. No short paragraph is going to give you the insight.
Build a parse tree in a recursive descent parser, and you'll notice that you build it from the top down - even if you actually construct from the bottom up, the stack frames that construct the parent nodes are already on the stack when the bottom is constructed.
Once you understand LL, LR can be explained by looking at it the other way around: building the parse tree from the bottom up without any higher-up context. Simply states and transitions based on the token stream. The details of state machine is generally best left to a parser generator though, unless you want to write your own.
Once you understand the idea of LR, GLR is easy to understand: instead of a single tree, GLR parses a forest of trees, which it can thin out as it eliminates possibilities with more context. This gives GLR parsing a lot of power, at the cost of higher algorithmic complexity (how bad depends on the grammar).
And so on. The details of other algorithm variants can mostly be understood in relation to these core approaches.
> If you implement a top-down recursive descent parser, you'll quickly learn how it maps to LL.
This will indeed tell you how top-down parsers work. But this doesn't tell you very much about LL specifically.
The trickiest part of parsing is deciding what path to take. When you hand-write a recursive descent parser, your code for deciding between alternatives is ad hoc. For example if you are parsing JSON you'll have some code that looks like this:
A person writing a recursive descent parser has to invent this chain of "if" tests manually and convince him/herself that it is correct. For JSON it's simple because JSON is LL(1). For other languages it can be much more complicated, and can involve multiple tokens of lookahead.
Almost all of the "secret sauce" in LL, LR, etc. is how to construct these tables of if/then in a way that is rigorous and provably correct.
It's true that understanding generally how top-down and bottom-up parsers work is a great step towards understanding parsers.
The first and follow sets for LL are the first hurdle you'll hit when you try to parse something where the next production (the next path) doesn't consume the token straight away. The theory becomes instantly accessible for practical reasons. So I respectfully disagree.
You'll quickly learn there's a mechanical transformation from LL(1) grammars to recursive descent. And if you write a recursive descent parser with one token of lookahead, no backtracking, and straightforward syntax actions, it has a mechanical translation into LL(1).
Anyway, I described the journey I went through learning compiler theory before I went to college. I covered undergrad compiler course content long before I left secondary school, and understood it at a much deeper level. I got a job at Borland working on the Delphi compiler (6 years I put in) on the back of the same knowledge. It worked for me.
Sorry for being a bit pedantic about it. People sometimes write hand-written recursive descent parsers and call them LL parsers, which is sorta true at some level, but misses IMO the key ingredient of automatically-constructed parse tables. I may have been a bit over-eager in reiterating this distinction.
How did you like working on Delphi? I wish I had been studying this stuff that early!
It was educational, to say the least. I ended up working mostly remotely (H-1B visa, when it came through, I didn't follow through on - I stayed in London with my GF). Apart from learning lots of implementation tricks in compilers (IDE / code completion support especially) and efficient code, as well as picking up experience in things that were more incidental to my immediate interests, like smart linking (turns out, it's very similar to garbage collection) and x64 disassembly for the 64-bit debugger.
More generally I learned how to tackle very complex problems under my own steam, even more so than I had as an autodidact. I have no difficulty drilling down to the next level, all the way to the CPU if necessary, if I have a problem, even if I'm starting out at the level of an interpreted language. I currently work on a web app written in Rails / Coffeescript combo; but I can debug MRI if we get an Ruby issue, and I can and have debugged MySQL when we had crashing issues.
> Out of curiosity, where would one find an explanation (potentially dry, long, and complicated) of the algos you mention here?
> In other words - how did you get to know them?
My book recommendation: Dick Grune, Ceriel J.H. Jacobs - Parsing Techniques: A Practical Guide (Second Edition). Many people will also recommend the "dragon book" (Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman - Compilers: Principles, Techniques, and Tools (Second edition)), which is not a bad book.
If you want to dive deeply into parsing and want to read about all the details over hundreds of pages, the Grune-Jacobs book is clearly the recommdendation. On the other hand, if you just want to understand the theory behind the parsing stage of a compiler (as one part of a compiler - and there are lots of other parts, too), you will probably prefer the dragon book.
But actually, I did read this excellent article for the second time. However, unless you are skilled in the art, you should be using ANTLR4, Honey Badger. Terence Parr deserves a Turing award.
For parsing, most upper div compiler classes start off with CFGs, dip briefly into LL recursive descent and then conclude with LALR. If people remember anything, it's SHIFT/REDUCE. Unfortunately, there doesn't seem to be any standard tools for the LL(1) equivalent, FIRST/FOLLOW tables. And as the article shows, they aren't equivalent. Each has strengths but LALR has tools.
Except for ANTLR. Which started out as recursive descent ...
Parsing Techniques doesn't really have any competition. The compiler books, like the compiler classes, can only give a little attention to parsing. Given how strong the available tools are and how hard the remaining subjects are (SSA, code generation, ...) maybe they have a point.
Honestly, having written a few a parsers, i prefer PEG with recursive descent, or parser combinator.
I find recursive descent parsers are easy to understand, write, debug(you can drop in a breakpoint a follow it through), and the resultant code is actually readable and clean. It's also fairly easy to add error reporting and recovery. As long as you don't do complicated stuff, they're really fast to.
You don't particularly need to be an expert to write recursive descent parsers either.
There are plenty of other parser generators than ANTLR and interesting parsing techniques that fit the bill. Not that there's anything wrong with ANTLR.
The last few time when I've been working on programming language prototypes, I've written the parser using Parsec parser combinators in Haskell. It's super fast and easy to use, however it's more like syntactic sugar for recursive descent parsers than a rigorous parser generator that works for a certain grammar class.
I fail to see the mystery. In my opinion, state machines are the most straight-forward parts of programming.
There is a learning curve with lex and yacc but if you cannot ever use these programs then what can you really call yourself a programmer?
It's the endless search for a "new" programming language and the layers upon layers of needless abstraction created by today's "programmers" that I find mystifying.
Unfortunately, given the extremely limited hardware of 1960s computers (not helped by the lack of an efficient algorithm), the parsing of an arbitrary CFG was too slow to be practical. Parsing algorithms such as LL, LR, and LALR identified subsets of the full class of CFGs that could be efficiently parsed. Later, relatively practical algorithms for parsing any CFG appeared, most notably Earley's 1973 parsing algorithm. It is easy to overlook the relative difference in performance between then and now: the fastest computer in the world from 1964-1969 was the CDC6600 which executed at around 10 MIPS; my 2010 mobile phone has a processor which runs at over 2000 MIPS. By the time computers had become fast enough for Earley's algorithm, LL, LR, and friends had established a cultural dominance which is only now being seriously challenged - many of the most widely used tools still use those algorithms (or variants) for parsing. Nevertheless in tools such as ACCENT / ENTIRE and recent versions of bison, one has access to performant parsers which can parse any CFG, if that is needed.
from: http://tratt.net/laurie/blog/entries/parsing_the_solved_prob...
Featured here 6 years ago: https://news.ycombinator.com/item?id=2327313