From string to AST: parsing (2019)

mdaniel · 7 hours ago

> quadruple G=(N,Σ,P,S), where T is a finite set of nonterminal symbols, Σ a finite set of terminal symbols, P is a set of production rules and S is a start symbol.

I think "T" is supposed to be "N" in that sentence[1], based solely upon the further use of "N" nomenclature in subsequent paragraphs

1: he said, 5 years too late into a forum just discussing the article

reply

KPGv2 · 7 hours ago

This was a really great read! I'm wrote the tree sitter grammar for the Unison programming language, and discovered I really like the work involved in pattern matching that writing tokenizers and parsers comes down to. It also gives you an in-depth understanding of how the language works that you've writing a parser for, and how tooling works.

Like if you have an AST with the ability to map onto code that is displayed in your IDE, the algorithm for an IDE to refactor a variable name is to traverse up the AST until you get to the variable's declaration and then traverse all sibling trees, changing each matching name, but stopping a traversal whenever you encounter a new binding with same name. Code folding is to identify the categories of node that are "foldable" and then you hide every child of that node. Etc. It's all tree traversal algorithms.

It gives you a deep appreciation for how powerful the tooling can be thanks to proper parsing.

reply

atan2 · 6 hours ago

It's nice to review some of this theory after a week of coding my own interpreter. I have been studying about compilers at pikuma.com the whole week and reading this article after coding a parser is a great way of reviewing what I've implemented.

reply

marxisttemp · 9 hours ago

I somewhat regularly stop to marvel that one of the greatest anarchist thinkers of our time is also responsible for foundational theories in linguistics that also correspond intimately with the foundational theories of computing. God bless Noam.

reply

revskill · 8 hours ago

Theory without examples is like traveling without a memory.

reply

ckok · 10 hours ago

I think this makes it sound a lot more difficult than it has to be, with the formal theory.

When it's really one of the most simple things if you divide it in parts and look at it from a tokenizer (string to list of tokens) and parser on top. Where the tokenizer can usually be very simple: a loop, large switch on the current character, where a choice is made on "what can this be", and making it into a formal token or error. Then a simple recursive parser that can almost be a 1 to 1 copy of the (E)BNF.

reply

jrop · 10 hours ago

I love writing parsers like this. Add in Pratt Parsing for operator precedence and writing parsers can be really easy.

reply

joz1-k · 3 hours ago

I had exactly the same feeling as you after reading the article. And interestingly, all production parsers for all major languages are hand-written recursive descent parsers. On the other hand, if you inspect the actual code for these production parsers (even for newer languages like Swift, Scala, Kotlin, or Rust), the complexity and amount of code is still quite staggering.

reply

taeric · 2 hours ago

Isn't a large portion of the code to get friendlier messages for the user?

reply

detourdog · 9 hours ago

I got the impression the author was trying to add higher level reasoning to the chosen term for string to AST parsing.

I felt that they were pointing out how the cognitive load of understanding is effected by word choice.

reply

antononcube · 9 hours ago

It seems you are describing how functional parsers (aka parser combinators) work.

(BTW, there is a "Parser combinators" section in the featured post/article.)

reply

marcosdumay · 6 hours ago

The bad news about those is that it's easy to mindlessly create a parser that runs on exponential time.

The good news is that this happens in the grammar definition. So once you define your language well, you don't have to watch for it anymore.

reply

antononcube · 4 hours ago

Insightful!

Do you know of any "large scale" research on this? I.e. analysis of multiple related projects and/or of "real life stories."

(I agree regardless.)

reply

marcosdumay · 2 hours ago

I don't know about any real-world study. But there are people complaining about it from time to time, and it's quite obvious from the theory.

reply

DemocracyFTW2 · 8 hours ago

IMHO it gets even better when you can use regular expressions and write a 'modal' parser where each mode is responsible for a certain sub-grammar, like string literals. JavaScript added the sticky flag (y) to make this even simpler.

reply

delta_p_delta_x · 10 hours ago

> All is set and done

For the record, the phrase is 'when all is said and done'[1].

[1]: https://en.wiktionary.org/wiki/when_all_is_said_and_done

reply

detourdog · 9 hours ago

I thought it was a nice pun:)

reply

KPGv2 · 7 hours ago

Yeah I was extremely triggered.

reply

antononcube · 9 hours ago

Very long write-up! If you read it and like the content, you might consider ditching Python and start using Raku a lot and often.

reply