Rabbit: A Compiler for Scheme/Chapter 1

[Page 17]

7 l. Introduction The work described here is a continuation (!) of that described in [SCHEME], [Imperative], and [Declarative]. Before enumerating the points of the thesis, we summarize here each of these documents.

A. Background In [SCHEME] we (Gerald Jay Sussman and the author) described the implementation of a dialect of LISP named SCHEME with the properties of lexical scoping and tail-recursion; this implementation is embedded within MacLISP [Moon], a version of LISP which does not have these properties. The property of lexical scoping (that a variable can be referenced only from points textually within the expression which binds it) is a consequence of the fact that all functions are closed in the "binding environment". [Moses] That is, SCHEME is a "full-funarg" LISP dialect. (Note Full-Funarg Example) The property of tail-recursion implies that loops written in an apparently recursive form will actually be executed in an iterative fashion. Intuitively, function calls do not "push control stack"; instead, it is argument evaluation which pushes control stack. The two properties of lexical scoping and tail-recursion are not independent. In most LISP systems [LISPl.5M] [Moon] [Teitelman], which use dynamic scoping rather than lexical, tail-recursion is impossible because function calls must push control stack in order to be able to undo the dynamic bindings after the return of the function. On the other hand, it is possible to have a lexically scoped LISP which does not tail-recurse, but it is easily seen that such an implementation only wastes storage space needlessly compared to a tail-recursing implementation. [Steele] Together, these two properties cause

[Page 18]

8 SCHEME to reflect lambda-calculus semantics much more closely than dynamically scoped LISP systems. SCHEME also permits the treatment of functions as full-fledged data objects; they may be passed as arguments, returned as values, made part of composite data structures, and notated as independent, unnamed ("anonymous") entities. (Contrast this with most ALGOL-like languages, in which a function can be written only by declaring it and giving it a name; imagine being able to use an integer value only by giving it a name in a declaration!) The property of lexical scoping allows this to be done in a consistent manner without the possibility of identifier conflicts (that is, SCHEME "solves the FUNARG problem" [Moses]). In [SCHEME] we also discussed the technique of 'continuation-passing style", a way of writing programs in SCHEME such that no function ever returns a value.

In [Imperative] we explored ways of exploiting these properties to implement most traditional programming constructs, such as assignment, looping, and call-by-name, in terms of function application. Such applicative (lambda-calculus) models of programming language constructs are well-known to theoreticians (see [Stoy], for example), but have not been used in a practical programming system. All of these constructs are actually made available in SCHEME by macros which expand into these applicative definitions. This technique has permitted the speedy implementation of a rich user-level language in terms of a very small, easy-to-implement basis set of primitive constructs. In [Imperative] we continued the exploration of continuation-passing style, and noted that the escape operator CATCH is easily modelled by transforming a program into this style. we also pointed out that transforming a program into this style enforces a particular order of argument evaluation, and makes all intermediate computational quantities manifest as variables.

In [Declarative] we examined more closely the issue of tail-recursion,

[Page 19]

9 and demonstrated that the usual view of function calls as pushing a return address must lead to an either inefficient or inconsistent implementation, while the tail-recursive approach of SCHEME leads to a uniform discipline in which function calls are treated as GOTO statements which also pass arguments. We also noted that a consequence of lexical scoping is that the only code which can reference the value of a variable in a given environment is code which is closed in that environment or which receives the value as an argument; this in turn implies that a compiler can structure a run-time environment in any arbitrary fashion, because it will compile all the code which can reference that environment, and so can arrange for that code to reference it in the appropriate manner. Such references do not require any kind of search (as is commonly and incorrectly believed in the LISP community because of early experience with LISP interpreters which search a-lists) because the compiler can determine the precise location of each variable in an environment at compile time. It is not necessary to use a standard format, because neither interpreted code nor other compiled code can refer to that environment. (This is to be contrasted with 'spaghetti stacks" [Bobrow and Vegbreit].) In [Declarative] we also carried on the analysis of continuation-passing style, and noted that transforming a program into this style elucidates traditional compilation issues such as register allocation because user variables and intermediate quantities alike are made manifest as variables on an equal footing. Appendix A of [Declarative] contained an algorithm for converting any SCHEME program (not containing ASET) to continuation-passing style.

We have implemented two compilers for the language SCHEME. The purpose was to explore compilation techniques for a language modelled on lambda-calculus, using lambda-calculus-style models of imperative programming constructs. Both compilers use the strategy of converting the source program to continuation

[Page 20]

10 passing style.

The first compiler (known as CHEAPY) was written as a throwaway implementation to test the concept of conversion to continuation-passing style.

The first half of CHEAPY is essentially the algorithm which appears in Appendix A of [Declarative], and the second is a simple code generator with almost no optimization. In conjunction with the writing of CHEAPY, the SCHEME interpreter was modified to interface to compiled functions. (This interface is described later in this report.) The second compiler, with which we are primarily concerned here, is known as RABBIT. It, like CHEAPY, is written almost entirely in SCHEME (with minor exceptions due only to problems in interfacing with certain MacLISP I/0 facilities). Unlike CHEAPY, it is fairly clever. It is intended to demonstrate a number of optimization techniques relevant to lexical environments and tail-recursive control structures. (The code for RABBIT, with commentary, appears in the Appendix.)

B. The Thesis

(1) Function calls are not expensive when compiled correctly; they should be thought of as GOTO statements that happen to pass arguments.

(2) The combination of cheap function calls, lexical scoping, tail-recursion, and "anonymous" notation of functions (which are not independent properties of a language, but aspects of a single unified approach) permits the definition of a wide variety of "imperative" constructs in applicative terms. Because these properties result from adhering to the principles of the well-known lambda-calculus [Church], such definitions can be lifted intact from existing literature and used directly.

[Page 21]

(3) 4

(5) 6) ll A macro facility (the ability to specify syntactic transformations) makes it practical to use these as the only definitions of imperative constructs in a programming system. Such a facility makes it extremely easy to define new constructs.

A few well-chosen optimization strategies enable the compilation of these applicative definitions into the imperative low-level code which one would expect from a traditional compiler.

The macro facility and the optimization techniques used by the compiler can be conceptually unified. The same properties which make it easy to write the macros make it easy to define optimizations correctly. In the same way that many programming constructs are defined in terms of a small, well-chosen basis set, so a large number of traditional optimization techniques fall out as special cases of the few used in RABBIT. This is no accident.

The separate treatment of a large and diverse set of constructs necessitates separate optimization techniques for each. As the basis set of constructs is reduced, so is the set of interesting transformations. If the basis set is properly chosen, their combined effect is "multiplicative" rather than "additive".

The technique of compiling by converting to continuation-passing style elucidates some important compilation issues in a natural way. Intermediate quantities are made manifest; so is the precise order of evaluation.

Moreover, this is all expressed in a language isomorphic to a subset of the source language SCHEME; as a result the continuation-passing style version of a program inherits many of the philosophical and practical advantages.

For example, the same optimization techniques can be applied at this level as at the original source level. While the use of continuation-passing style may not make the decisions any easier, it provides an effective and

[Page 22]

(7) 8

(9) 12 natural way to express the results of those decisions.

Continuation-passing style, while apparently applicative in nature, admits a peculiarly imperative interpretation as a consequence of the facts that it requires no control stack to be evaluated and that no functions ever return values. As a result, it is easily' converted to an imperative machine language.

A SCHEME compiler should ideally be a designer of good data structures, since it may choose any representation whatsoever for environments. RABBIT has a rudimentary design knowledge, involving primarily the preferral of registers to heap-allocated storage. However, there is room for knowledge of "bit-diddling" representations.

We suggest that those who have tried to design useful UNCOL's (UNiversal Computer-Oriented Languages) [Sammet] [Coleman] have perhaps been thinking too imperatively, and worrying more about data manipulation primitives than about environment and control issues. As a result, proposed UNCOLs have been little more than generalizations of contemporary machine languages. We suggest that SCHEME makes an ideal UNCOL at two levels. The first level is the fully applicative level, to which most source-language constructs are easily reduced; the second is the continuation-passing style level, which is easily reduced to machine language. We envision building a compiler in three stages: (a) reduction of a user language to basic SCHEME, whether by macros, a parser of algebraic syntax, or some other means; (b) optimization by means of SCHEME-level source-to-source transformations, and conversion to continuation-passing style; and (c) generation of code for a particular machine. RABBIT addresses itself to the second stage. Data manipulation primitives are completely ignored at this stage, and are just passed along from input to output. These primitives, whether integer arithmetic, string

[Page 23]

(10 13 concatenation and parsing, or list structure manipulators, are chosen as a function of a particular source language and a particular target machine.

RABBIT deals only with fundamental environment and control issues common to most modes of algorithmic expression. while syntactic issues tend to be rather superficial, we point out that algebraic syntax tends to obscure the fundamental nature of function calling and tail-recursion by arbitrarily dividing functions into syntactic classes such as "operators" and "functions". ([Standish], for example, uses much space to exhibit each conceptually singular transformation in a multiplicity of syntactic manifestations.) The lack of an "anonymous" notation for functions in most algebraic languages, and the inability to treat functions as data objects, is a distinct disadvantage. The uniformity of LISP syntax makes these issues easier to deal with.

To the LISP community in particular we address these additional points:

(ll (12 Lexical scoping need not be as expensive as is commonly thought. Experience with lexically-scoped interpreters is misleading; lexical scoping is not inherently slower than dynamic scoping. While some implementations may entail access through multiple levels of structure, this occurs only under circumstances (accessing of variables through multiple levels of closure) which could not even be expressed in a dynamically scoped language. Unlike deep-bound dynamic variables, compiled lexical access requires no search; unlike shallow-bound dynamic variables, lexical binding does not require that values be put in a canonical value cell. The compiler has complete discretion over the manipulation of environments and variable values. The "display" technique used in Algol implementations can be generalized to provide an efficient solution to the FUNARG problem.

Lexical scoping does not necessarily make LISP programming unduly difficult.

[Page 24]

14 The very existence of RABBIT, a working compiler some fifty pages in length written in SCHEME, first implemented in about a month, part-time, substantiates this claim (which is, however, admitted to be mostly a matter of taste and experience). (Note Refinement of RABBIT) SCHEME has also been used to implement several Al problem-solving languages, including AMORD [Doyle]. '