February 19, 2015

Source code, while a common property of programming languages, is by no mean universal. Languages like Smalltalk, traditional interactive BASICs, and some Forths often have "source of truth" representations other than text files. Additional REPLs of all sorts often retain program state that exists only for the duration of an interactive session. What are the trade-offs and why does text continue to be the dominant format?


A key feature of early micro-computers was an integrated BASIC that contained not only an interpreter, but an integrated editor. While a few micro-computer BASICs stored raw program text, the majority stored your program as some combination of text and pre-parsed tokens. This was partially due to fact most of the early BASICs were descended from or inspired by either Tiny BASIC or Microsoft BASIC. But it was also likely the result of a desire to save memory on resource constrained devices.

Remarkably, this design choice had several added benefits. Programs were often "automatically" reformatted to a consistent style, as post-parsed results often eliminated things like extra whitespace, or reintroduced it but with a consistent style. Early refactoring tools, like line renumbering, were easier to implement when programs were stored as an in memory linked list.

Smalltalk and Self apply an even more pure "sourceless" approach in which classes or prototypes are primarily represented as in-memory data structures. This has the advantage, and the drawback, of program code being stored as a memory snapshot.

Some Forths have experimented in this direction. A few experimental Forths are purely sourceless. But more often Forths have explored using pre-parsed representations. Open Firmware used a representation (FCode) in which each word was pre-parsed into a number. Color Forth stores each Forth word as a 32-bit Huffman encode, color tagged value. In both cases, however, a interpretation phase actually compiles / interprets these formats and populates the runtime Forth dictionary. Nonetheless, most quality direct / indirect threaded Forths implement a "SEE" word which can effectively decompile / disassemble word definitions at runtime, usually with a high degree of reliability.

In my own Forth explorations, I've often struggled with the tension between the value of source code and the value of having Forth's runtime state be the source of truth. Falling back to source code, be it plain text, or a pre-parsed format has the advantage of preserving the bootstrapping process in a changeable form. Image based systems, as is typical of Smalltalk have the odd property that the lineage of a system is often embedded in the whole system snapshot. Bootstrapping a new variant involves starting with an existing Smalltalk image and evolving it to a new desired state. Just as I try to avoid carrying with me a complex .vimrc file, I chafe at the idea of my programming environment being a weathered thing carrying its baggage along. This seems to run very much counter to the idea of starting something new.

Pre-parsed representations have the appeal of making some kinds of automatic refactoring easier to perform in software robustly. But the danger is that one needs at least basic tooling to create such a representation when bootstrapping or migrating. When BASICA transitioned to GWBasic their shared token format meant things went smoothly. However, the transition to QBasic, which abandoned tokenized source files in favor of plain text, was a jarring transition. A copy of the old interpreter was required, so that you could export to plain text. The new interpreter simply did not support the old format.

A related issue with Forth is the question of source code files vs source code blocks. More bare to the metal Forths, particularly ones that run without an operating system are known to use 1KB data blocks displayed in a 64x16 area as the atomic unit of source code. This has many nice properties: a file system is not required, line and screen editors are dramatically easier to implement, and the artistic constraint of a fixed size page makes for very readable, consistent, and succinct program text. Tooling for editing or conversion is usually easy to construct. The drawback is that the format is non-standard, and potentially wasteful. Also, while all block programs can be converted losslessly to files, the reverse is not true (at least not trivially).

A common practice with block editors is the use of so-called shadow blocks. Blocks are handled as a pair, one for source code, the other for comments. The two are either displayed side by side, or a toggle button shifts from one to the other. This has the advantage of allowing free form comments in the shadow block. Also printouts of source code have the nice property that the source and shadow blocks, at a reasonable font size, can sit side by side in a listing. The approach is likely also popular because a 64x16 region is a fairly tight space when comments are inline. I personally find shadow blocks unpleasant, but that seems to be a minority view.

I am frequently tempted to devise my own text editor. For my various experimental Forths, I've implemented reasonably decent block editors, usually in 4-10 blocks of code. A general text editor is obviously more daunting. What usually stops me from commencing in earnest is that contrast. A block editor, being surprisingly easy to implement to a level of quality that would make it pleasant enough to use daily, seems like the right idea. Unfortunately, relatively little of my day to day text processing is with blocks.

I'm occasionally tempted to impose a block editing style on file editing. I imagine some interchange between the two might be possible, cleverly marking line continuations in some machine decodable style. Most source code I deal with is 80 column. An 80x25 sized block might help make this work.

Forth is dangerously close to being a language that could be sourceless. It's dictionary, particularly in simpler threaded Forths, is very close to isomorphic with the source code. In my mind there are two main issues with going there. First, often Forth takes advantage of redefining words so that they mean different things in different contexts. I certainly have done this in my programs. But usually I feel like something is wrong when I do it. The other problem is the dictionary format itself. While most Forth dictionaries are fairly similar, I've never really been happy with them. As Chuck Moore pointed out in a quote in Thinking Forth, the Forth dictionary is often the most complicated data structure in a Forth system. That complexity, at least in a Forth context, feels wrong.

The potential advantages of a sourceless Forth are tempting, however. "Good" Forth programs consist of lots of 1-2 line definitions. It's tempting to think that seeing the context of a given definition (in the sense of showing the definitions of the words used to define it and the words that use it), would be nice. However, the experience with blocks in particular, suggests this might not pan out. Part of the asthetic appeal of blocks is that they can be formatted in two dimensions, without consideration of carriage returns. It's unlikely that automatic methods will have quite the same flare, regardless of how cleverly contextual they are. Auto-complete is nice when searching, but likely not when reading code. Literate Programming would also seem to suggest that limiting the location of definitions based on use doesn't fit the narrative style of the human mind. It's hard to say with certainty without trying it.

All of this seems to suggest that source code is here to stay. Relatively little has crept in beyond ASCII. Remarkably, we did get beyond programming languages printable with a Selectric typeball font. Though the continued precence of Trigraphs in C++ would seem to contradict this to some degree. I wish for more in my source format at times (particularly color or perhaps limited font control), but just as often I wish for less in the form of blocks. I think I should console myself with the fact we've somehow managed to stick with a minimalistic format like text for source code. So few things in software are minimal these days after all.

Forth, Programming Languages
blog comments powered by Disqus