[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
readers & tokenizers

To: rrrs-authors@mc.lcs.mit.edu
Subject: readers & tokenizers
From: andy@hobbes.ads.ARPA (Andy Cromarty)
Date: Thu, 9 Apr 87 14:11:51 PDT
One of the difficulties with conventional read tables is that
they become very dangerous in a "real" production software
environment, relying on lots of Scout's Honor programming
practices and usage restrictions.  If you write one 10,000 line
component of a large system and I write another, there is a risk
that you will change the semantics of (READ) out from under me,
breaking my code in ways that I will find extremely difficult
to debug.  To the extent that we are serious about having people
use Scheme outside the classroom, this is an important concern.

I have used many LISPs that lack both a user-modifiable reader and
reader "macros" -- in fact, I have written compilers using those LISPs
(including, of course, lexical and syntactic analysis) in a relatively 
painless fashion.  It merely requires good coding practices combined
with a little extra effort to build up the lexical analysis subsystem
yourself (really a fairly small number of additional functions).  I
offer this as an "existence proof" of sorts to indicate that we don't
really *need* extra gook in the reader, and to suggest that we should
be careful about how and why we make it more complex.

I have no objection to an improved input system if it is stateless.
A trivial new function that would be useful is a (READ-LINE [PORT])
procedure that reads a sequence of characters terminated by NEWLINE 
or EOF-OBJECT, whichever it encounters first, returning it as a string.  
A slight extension of this capability would require you to provide
the termination character yourself, e.g. (READ-LINE CHAR [PORT]).
This can be further extended to take a list of termination characters
instead of one character (making it similar to the scanning and breaking
functions in SNOBOL), or perhaps a predicate procedure of one argument that
returns #T iff the character it is passed is a token terminator, viz.
(READ-TOKEN TERMINATOR?-PROC [PORT]). This permits you to perform 
lexical analysis of an input stream in a quite straightforward fashion
without adding significantly to the complexity of the reader for the >90% 
of the cases where you are just reading symbols, lists, and other objects 
that the reader already handles well.

The advantage I see in such an approach is that all the state is tied
up in your predicate or your specific invocation of READ-LINE.  There
thus is no risk of collisions between different modules, even if they
are interleaving reads on the same port.  Further, it can be implemented
in a quite straightforward fashion.

One might object that this could be "inefficient."  Such an argument
doesn't seem compelling on the face of it, first because a non-stateless
reader imposes risks that are significantly more costly than the loss
of a few cpu cycles, and second because if we take seriously the suggestion
that Scheme should be a useful systems programming language, then such
an approach will help ensure that implementors will be pressured into
improving the performance of those critical parts of the Scheme system
that usually are very slow in LISPs (i.e. readers). 

I don't know that I want to call this a "proposal," but at least I
feel that the spirit of a proposal is there, i.e. that the reader 
should be stateless if we can find an effective way to achieve that goal.

Finally, I would advocate one extension to the reader itself.
That is the inclusion of #+ and #-.  We already have added these
to the ADS Scheme reader, because we found that it had a tremendous
impact on portability of code from one LISP environment to another.
Again, this sort of capability is critical for large development
efforts or for the production of commercial tools that are intended
to work in multiple LISP dialects.  I further suggest that "scheme"
specifically be recognized by #+ and #- and that "#+scheme" be true
in R?RS Scheme.

					asc
Prev by Date: multiple return values
Next by Date: number syntax
Prev by thread: readers & tokenizers
Next by thread: readers & tokenizers
Index(es):
- Date
- Thread