I agree, which is why I started my own series, as lmm mentioned, with showing simple ways of (ab)using gcc to figure out how to do simple/primitive code-generation, and built up from that, instead of building down from lexing/parsing.
Lexing/parsing is important of course, but it's been done to death, and for most of the simple types of languages people tend to use for teaching, it's a simple problem.
Code generation, on the other hand, is still pretty poorly covered, in my opinion, and something people tend to struggle with a lot more, even if you resort to tools like LLVM (and that's fine if that's what you want, but I'd argue you should try a lower level approach at least once to understand some of the challenges)
Exactly. I'm in the pursuit of build one. My first questions? How make a REPL, a debugger, how implement pattern matching, type checking, if a interpreter can be fast enough, if possible to avoid to do a whole VM for it, etc...
Exist a LOT of practical questions that are left as "a exercise for the reader" that need clarifications.
In the end, the parsing steps can be summarized as:
- Do lisp/forth if wanna do the most minimal parsing and do a lisp
Or:
- Use a parse generator, if not care about the quality of this
Or:
- Use a top-down parsing by hand, if wanna some control
ANY OTHER OPTION is non-optimal, and will divert from the task, EXCEPT if you wanna do something novel.
If will let aside the parsing stuff, we can put more in the hard and more rewarding aspects.
I found http://hokstad.com/compiler a lot easier to understand than any other compiler tutorial I've seen. Writing a compiler the way you'd write any other program. It does end up with e.g. a distinct parser, but only when the reasons that might be a good idea become apparent.
Though in retrospect I think it should have been two separate series (and I need to write a few more parts; I'm very close to having it compile itself as of last night).
I think it'd have been better to evolve the initial simple language into an equivalently simple language to parse, and kept the long slog towards compiling Ruby as a separate thing.
Especially as that has complicated things enough to be in severe need of various refactoring (which I'm avoiding until it will compile itself, at which point I'll start cleaning it up while insisting on keeping it able to compile itself..).
The parser itself started out conceptually quite clean, for example, but the Ruby grammar is horribly complex, and I keep having to add exeptions that's turned it quite convoluted. I don't doubt it can be simplified with some effort once I have the full picture, but it's not great for teaching.
If I were going to write a programming language for myself, I would lay out the following challenge to myself:
> You are only allowed to store the AST and variable names found by parser. The input to the lexer is not allowed to be persistent.
To do this, your lexer would need to be in some sense invertible, capable of both producing a source-code-representation of an AST given some naming metadata, as well as converting that source-code-representation back to names+AST.
I think that would make the lexing + parsing task worthy of an 80% article.
How about making your comment useful by enlightening the community as to the relative unimportance of these aspects of contemporary PL design, and suggest some other aspects for would-be designers to focus on instead? Just imagine: you could even link to resources for the latter!
I think it's just because they run out of steam after doing the "first part", and there are backends like LLVM available.
The Red Dragon book has the same issue, except they spend 800 pages on basic parsing and only end up making it sound terrifying and mathy. Definitely recommend against reading it.
Maybe a different place to start would be making your own bytecode?
One problem is that syntax is just one (rather shallow) part of the design of a programming language, but it's one that gets a lot of attention because everyone who's used two programming languages can tell that it's a point where languages differ. Semantics (rules about what blobs of syntax mean) is a much more interesting way for languages to differ, but most "build a language" tutorials I (and probably GP) have seen don't seem aware that there are even decisions to be made there. The "your own" bit in the title is also a bit upsetting for a post that hands the reader a language and its implementation instead of talking about something of the reader's own design.
This fascination with "how" to build something, without considering "what" and "why", seems to be an issue that gets repeated time after time again.