The case for AST-based computer language representation and specification
This is the idea:
- To have a language which is defined in its AST form, rather than
its human-readable form.
- To save it as a structured AST dump rather than as readable source
code.
- To have editors and viewers that format the AST into a readable
form for human use.
(Note: AST == Abstract Syntax Tree, the internal representation
of the source code which is the output of the parser in a
compiler.)
Some advantages:
- The user can choose their preferred syntax, the spacing and brace
arrangements that seem most comfortable to them, and everyone else's
code appears to them formatted this way.
- The cruft of language development accumulates in the AST spec not
in the syntax read by the user. The user-visible syntax can be
upgraded independently from upgrades to the AST spec. For example,
the AST can be upgraded from version 4 to 5, and a new syntax
formatter written for version 5 which makes the code look just the
same to the user. In addition, all the existing code (in AST-dump
form) could be automatically upgraded to version 5 without the user
seeing any difference. Or alternatively the readable syntax could be
improved, and from the user's point of view all old code would
immediately appear to have been rewritten using the new syntax.
Compare this to current languages which jump through hoops to avoid
breaking the compilation of old source code, meaning lots of legacy
syntax, and new features bolted onto legacy syntax rather than using a
more natural or modern representation.
Some disadvantages:
- The user has less control over precise formatting, e.g. of complex
expressions -- they give up control of this to a particular code
formatter. However, with a good enough code formatter, I believe the
trade-off is worth it -- and the user can potentially write their own
(better) formatter, or tweak an existing one to improve its
formatting. I am experimenting with these ideas in Java (see below)
to see how well it could work.
- Diffs and patches appear in the AST-dump form. Whilst it would be
possible to structure the AST dump in a way that minimises diff size,
it still makes reviewing patches in raw form much harder. A viewer
would be required to convert patches into a readable form. An
alternative may be to save in a standardised readable formatted form,
and re-parse back into AST, but that gives up some of the advantages
of the AST-based approach. However, this is what I'm experimenting
with in Java right now -- using standard Java syntax as the saved
form.
One of the advantages of making the AST the primary specification
of the language is that it is half-way between the user and the tools.
So refactoring is easier because a refactoring tool doesn't have to
worry about how the refactored code should be formatted. A compiler
should be faster because parsing an AST dump is less complex.
Looking at source code from an AST point of view, a project is just
an immense tree. Taking Java as an example, the package names form a
tree with classes/enums/etc as leaves, and below those appear methods
and fields, and below those come type specifications and nested lists
of statements and expressions.
So in an abstract form we have this huge tree. We could dump it
out in folders and files, similar to how Java source is structured
right now -- which is probably the best way to interact with a version
control system (git, mercurial, etc).
Alternatively we could load it up into some kind of a database
optimised for storing trees. As a database, we could maintain indexes
for cross-referencing. This would make it very easy to do rename
refactoring -- if every reference to class A is a database pointer to
A, then we can rename A with one operation -- and similarly for local
variables and all other named entities. Reverse indexes could
optimise searches for references to a definition. This is an ideal
environment for refactoring and analysis. It could even be used for
global optimisation -- make a temporary copy of the database and
eliminate dead code or inline code globally, before it passing on to
the compiler.
Putting this into practice
As a practical test of some of these ideas, I'm working on a Java
editor which edits the AST not the source. The source is loaded up
and parsed, then the resulting AST reformatted to the user's preferred
syntax for display and editing. The edited AST is then reformatted
back into standard syntax before white space is adjusted to minimise
changes and it is saved.
Like this I will get a number of advantages:
- Uniform display of code mangled by tools or other people's weird
formatting preferences, whilst still maintaining the original
formatting in unchanged parts of the source (thus minimising
patches).
- The opportunity to experiment with other display syntaxes, or
representations optimised to what is possible with a display rather
than what is possible with a line-printer, e.g. using colour or visual
structure to indicate something rather than symbols or specific
keywords, and so on.
- The opportunity to experiment with analysis and refactoring
enabled by working with the AST.
I'm using Eclipse JDT as a base for now (org.eclipse.jdt.core.dom)
to see whether I can make this fly.
The future
I wonder whether we will ever see a language as I envisage? Having
worked through all this in my head, it is now frustrating to see
languages still being designed in terms of their ASCII representation.
It just seems so limiting. We will have to see how far I can develop
these ideas myself, and what the future may hold.
-- Peru, 19-Dec-2011
These pages and files, including applets and artwork,
are Copyright (c) 1997-2019 Jim Peters
unless otherwise stated. Please contact
me if you'd like to use anything not explicitly released, or if
you have something interesting to discuss.