[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Scheme String Operations: the Report



Here is the preliminary report...  Comments and suggestions?

----------------------------------------------------------------------

                       Scheme String Operations


Notes:

[] Strings are mutable vectors of characters.

[] The string datatype described here has very little dependence on
the character datatype from which it is built.  Thus it may correspond
to either of Common Lisp's STRING or SIMPLE-STRING datatypes.  In MIT
Scheme, the character datatype is a special extended ASCII.

[] Historically, two different methods have been used to specify
substrings: [1] a starting index and a length, and [2] a starting
index and an ending index.  Often both methods have been used for
different operations in the same data abstraction.
The arguments in favor of one method or the other seem fairly
uninteresting, but it does seem important to pick one method and use
it exclusively.  The latter method has been chosen, for no
particularly good reason.


Naming conventions:

[] Character-wise operations are normally specified in two forms, one
which operates on a substring, and another which operates on an entire
string.  The former is usually named "SUBSTRING-..." and the latter
"STRING-...".

[] Operations for which case sensitivity has meaning are specified in
two forms, a case-sensitive version and a case-insensitive version.
The latter is given the suffix "-CI" to distinguish it; if the
operation is a predicate, the suffix precedes the "?" at the end of
the name.

[] Currently, operation pairs which are "directed" will be given
similar names, such as, "STRING-FIND-NEXT-CHAR" and
"STRING-FIND-PREVIOUS-CHAR".  However, there are 3 different syllable
pairs used to distinguish these operations: "LEFT"/"RIGHT",
"PREVIOUS"/"NEXT", and "FORWARD"/"BACKWARD".  In each case the
particular pair was chosen for its meaning, but it may be better to
name them more consistently.

----------------------------------------------------------------------

                             Definitions


[] The phrases "X is a string" and "the string X" mean that X is an
object which satisfies the STRING? predicate.  Similarly, "X is a
character" means that X satisfies the CHAR? predicate.

[] The "length" of a string is the number of characters that it
contains.  This number is a non-negative integer that is determined at
the time of the string's creation.

[] The phrase "X is a valid index of Y", where Y is a string, means
that X is a non-negative integer that is strictly less than the length
of Y.  The index 0 refers to the first character, 1 to the second,
etc.

[] The phrase "<X,Y,Z> is a substring" means that:

    1.  X is a string.
    2.  Both Y and Z are non-negative integers.
    3.  Z is less than or equal to the length of X.
    4.  Y is less than or equal to Z.

This refers to that substring of X starting at Y (inclusive) and
ending at Z (exclusive).  If Y is equal to Z, the substring is empty.

[] The directions "left" and "right" refer to numerically decreasing
and numerically increasing indices, respectively.  Alternate names for
these directions are (respectively): "previous" and "next"; "backward"
and "forward".

----------------------------------------------------------------------

                      Basic Character Operations


These are the operations on characters that are required by the string
data abstraction.  The names of these operations are irrelevant; only
the functionality is required.


(CHAR-EQUAL? CHAR1 CHAR2)
(CHAR-EQUAL-CI? CHAR1 CHAR2)

These predicates are true iff CHAR1 and CHAR2 are the same character.
The former operation is case sensitive, while the latter is not.  Both
CHAR1 and CHAR2 must be characters.

(CHAR-LESS? CHAR1 CHAR2)
(CHAR-LESS-CI? CHAR1 CHAR2)

These predicates are true iff CHAR1 is strictly less than CHAR2 in the
character ordering.  The former operation is case sensitive, while the
latter is not.  Both CHAR1 and CHAR2 must be characters.


(CHAR-UPPER-CASE? CHAR)
(CHAR-LOWER-CASE? CHAR)

These predicates should be true only of upper and lower case
alphabetic characters, respectively.  CHAR must be a character.


(CHAR-UPCASE CHAR)
(CHAR-DOWNCASE CHAR)

These operations convert alphabetic characters to upper and lower
case, respectively.  If CHAR is not alphabetic, it is returned.  CHAR
must be a character.


                            Character-sets


A character-set is an abstract object that represents a (mathematical)
set of characters.  Character-set searches are most useful for parsing
strings.  The required operations on character-sets are:


(CHAR-SET-MEMBER? CHAR-SET CHAR)

A predicate which is true iff the character CHAR is a member of the
character-set CHAR-SET.

----------------------------------------------------------------------

                       Basic String Operations


These are the most primitive of the string operations.  All other
string operations can be constructed from these.


(STRING-ALLOCATE LENGTH)

This operation allocates a new string, of the given LENGTH, and
returns it.  The contents of the string are unspecified.  LENGTH must
be a non-negative integer.


(STRING? OBJECT)

This is a boolean operation which is true of all strings.

(STRING-LENGTH STRING)

This operation returns the length of STRING, which must be a string.
The value is a non-negative integer.


(STRING-REF STRING INDEX)

This operation returns the INDEX'th element of STRING, a character.
STRING must be a string, and INDEX must be a valid index of STRING.


(STRING-SET! STRING INDEX CHAR)

This operation stores the character CHAR as the INDEX'th element of
STRING.  STRING must be a string, INDEX must be a valid index of
STRING, and CHAR must be a character.

----------------------------------------------------------------------

                         Standard Operations


These operations are useful enough to deserve being "standard" in most
implementations.


(MAKE-STRING LENGTH CHAR)

This allocates a new string of the given LENGTH, and initializes all
of its elements to CHAR.  LENGTH must be a non-negative integer, and
CHAR must be a character.


(STRING-FILL! STRING CHAR)

This fills the string STRING with the character CHAR.


(STRING-NULL? STRING)

This predicate is true only of strings whose length is zero.  STRING
must be a string.


(SUBSTRING STRING START END)

This operation returns a new string, which is the substring designated
by <STRING, START, END>; this must be a valid substring designator.


(STRING-APPEND STRING1 STRING2)

This operation returns a new string, which is the concatenation of
STRING1 followed by STRING2, both of which must be strings.  This
operation may (optionally) be n-ary, for n greater than 0.


(STRING-COPY STRING)

This operation returns a new copy of STRING, which must be a string.


(STRING->LIST STRING)
(LIST->STRING CHARS)

These operations convert between strings and lists of characters.
STRING must be a string, and CHARS must be a list of characters.  The
result of either operation is a newly-created object of the
appropriate form.

----------------------------------------------------------------------

                          Motion Primitives


These operations are useful because they can be used to construct many
other string operations, e.g. SUBSTRING and STRING-APPEND.  If strings
are implemented in the usual way on conventional machines, these
operations are extremely simple to implement.  It is also possible to
in-line code them in some cases.


(SUBSTRING-FILL! STRING START END CHAR)

This fills the substring <STRING, START, END> with the character CHAR.


(SUBSTRING-MOVE-RIGHT! STRING1 START1 END1 STRING2 START2)
(SUBSTRING-MOVE-LEFT! STRING1 START1 END1 STRING2 START2)

These operations destructively copy the substring <STRING1, START1,
END1> to the string STRING2 starting at the index START2.  It must be
the case that <STRING2, START2, (+ START2 (- END1 START1))> is a
substring; this latter substring is destructively modified to contain
the contents of the former substring.

The operations differ only when the two substrings overlap, i.e. when
STRING1 and STRING2 are EQ? and the index sets of the substrings are
not disjoint.  In this case, the operations are defined to copy the
elements of the first substring serially.  SUBSTRING-MOVE-RIGHT!
copies the first substring from left to right, while
SUBSTRING-MOVE-LEFT! copies from right to left.  Thus, for example,
the two operations can be used to shift groups of characters right or
left, respectively, within a given string.

----------------------------------------------------------------------

                        Comparison Primitives


Each group of comparisons contains these operations:

    1.  Compare two substrings, case sensitive.
    2.  Compare two strings, case sensitive.
    3.  Compare two substrings, case insensitive.
    4.  Compare two strings, case insensitive.

The groups are described as a unit since they are essentially very
similar.  In all cases the arguments must be valid substrings or
strings.

(SUBSTRING-EQUAL? STRING1 START1 END1 STRING2 START2 END2)
(STRING-EQUAL? STRING1 STRING2)
(SUBSTRING-EQUAL-CI? STRING1 START1 END1 STRING2 START2 END2)
(STRING-EQUAL-CI? STRING1 STRING2)

These are boolean equality predicates, which are true only when the
two (sub)strings are the same length and contain the same characters.


(SUBSTRING-LESS? STRING1 START1 END1 STRING2 START2 END2)
(STRING-LESS? STRING1 STRING2)
(SUBSTRING-LESS-CI? STRING1 START1 END1 STRING2 START2 END2)
(STRING-LESS-CI? STRING1 STRING2)

These are boolean order predicates, which compare the (sub)strings
using the standard dictionary order; i.e. the leftmost unequal
characters determine the order, or if no such character exists, the
shorter (sub)string is less.  The character ordering is determined by
the character operations CHAR-LESS? and CHAR-LESS-CI?.


(SUBSTRING-MATCH-FORWARD STRING1 START1 END1 STRING2 START2 END2)
(STRING-MATCH-FORWARD STRING1 STRING2)
(SUBSTRING-MATCH-FORWARD-CI STRING1 START1 END1 STRING2 START2 END2)
(STRING-MATCH-FORWARD-CI STRING1 STRING2)

These operations compare the (sub)strings from left to right,
returning the number of characters that were the same.

(SUBSTRING-MATCH-BACKWARD STRING1 START1 END1 STRING2 START2 END2)
(STRING-MATCH-BACKWARD STRING1 STRING2)
(SUBSTRING-MATCH-BACKWARD-CI STRING1 START1 END1 STRING2 START2 END2)
(STRING-MATCH-BACKWARD-CI STRING1 STRING2)

These operations compare the (sub)strings from right to left,
returning the number of characters that were the same.

----------------------------------------------------------------------

                     Character Search Primitives


Each group of searches contains these operations:

    1.  Search substring for character, case sensitive.
    2.  Search string for character, case sensitive.
    3.  Search substring for character, case insensitive.
    4.  Search string for character, case insensitive.
    5.  Search substring for character in set.
    6.  Search string for character in set.

In all cases the arguments must be valid substrings, strings,
characters, or character-sets, as appropriate.


(SUBSTRING-FIND-NEXT-CHAR STRING START END CHAR)
(STRING-FIND-NEXT-CHAR STRING CHAR)
(SUBSTRING-FIND-NEXT-CHAR-CI STRING START END CHAR)
(STRING-FIND-NEXT-CHAR-CI STRING CHAR)
(SUBSTRING-FIND-NEXT-CHAR-IN-SET STRING START END CHAR-SET)
(STRING-FIND-NEXT-CHAR-IN-SET STRING CHAR-SET)

These operations return the index of the leftmost occurrence of the
character CHAR (or character-set CHAR-SET) within the given
(sub)string, or #!FALSE if there is no occurrence.


(SUBSTRING-FIND-PREVIOUS-CHAR STRING START END CHAR)
(STRING-FIND-PREVIOUS-CHAR STRING CHAR)
(SUBSTRING-FIND-PREVIOUS-CHAR-CI STRING START END CHAR)
(STRING-FIND-PREVIOUS-CHAR-CI STRING CHAR)
(SUBSTRING-FIND-PREVIOUS-CHAR-IN-SET STRING START END CHAR-SET)
(STRING-FIND-PREVIOUS-CHAR-IN-SET STRING CHAR-SET)

These operations return the index of the rightmost occurrence of the
character CHAR (or character-set CHAR-SET) within the given
(sub)string, or #!FALSE if there is no occurrence.

----------------------------------------------------------------------

                                 Case


Each of the following groups contains the following operations:

    1.  A predicate to determine a substring's case.
    2.  A predicate to determine a string's case.
    3.  A operation to destructively set a substring's case.
    4.  A operation to destructively set a string's case.

The meanings of the operations should be clear from their names.


(SUBSTRING-UPPER-CASE? STRING START END)
(STRING-UPPER-CASE? STRING)
(SUBSTRING-UPCASE! STRING START END)
(STRING-UPCASE! STRING)

(SUBSTRING-LOWER-CASE? STRING START END)
(STRING-LOWER-CASE? STRING)
(SUBSTRING-DOWNCASE! STRING START END)
(STRING-DOWNCASE! STRING)

(SUBSTRING-CAPITALIZED? STRING START END)
(STRING-CAPITALIZED? STRING)
(SUBSTRING-CAPITALIZE! STRING START END)
(STRING-CAPITALIZE! STRING)

The (sub)string in the ...CAPITALIZE... operations may not be null.

----------------------------------------------------------------------

                     Appendix: An Implementation


The following is an examplary implementation of the above string
operations (in terms of the basic operations).  This implementation is
intended to supplement the preceding specification by providing an
explicit formal description of the semantics of the operations.

There is no error checking in this code, since it was felt that it
would obscure the form somewhat.

;;;; Standard Operations

(define (make-string length char)
  (let ((result (string-allocate length)))
    (substring-fill! result 0 length char)
    result))

(define (string-fill! string char)
  (substring-fill! string 0 (string-length string) char))

(define (string-null? string)
  (zero? (string-length string)))

(define (substring string start end)
  (let ((result (string-allocate (- end start))))
    (substring-move-right! string start end result 0)
    result))

(define (string-append string1 string2)
  (let ((length1 (string-length string1))
	(length2 (string-length string2)))
    (let ((result (string-allocate (+ length1 length2))))
      (substring-move-right! string1 0 length1 result 0)
      (substring-move-right! string2 0 length2 result length1)
      result)))

(define (string-copy string)
  (let ((length (string-length string)))
    (let ((result (string-allocate length)))
      (substring-move-right! string 0 length result 0)
      result)))

(define (string->list string)
  (let ((end (string-length string)))
    (define (loop index)
      (if (= index end)
	  '()
	  (cons (string-ref string index)
		(loop (1+ index)))))
    (loop 0)))

(define (list->string chars)
  (let ((result (string-allocate (length chars))))
    (define (loop index chars)
      (if (null? chars)
	  result
	  (begin (string-set! result index (car chars))
		 (loop (1+ index) (cdr chars)))))
    (loop 0 chars)))

;;;; Motion Primitives

(define (substring-fill! string start end char)
  (define (loop index)
    (if (not (= index end))
	(begin (string-set! string index char)
	       (loop (1+ index)))))
  (loop start))

(define (substring-move-right! string1 start1 end1 string2 start2)
  (define (loop index1 index2)
    (if (not (= index1 end1))
	(begin (string-set! string2 index2 (string-ref string1 index1))
	       (loop (1+ index1) (1+ index2)))))
  (loop start1 start2))

(define (substring-move-left! string1 start1 end1 string2 start2)
  (define (loop index1 index2)
    (if (not (= index1 start1))
	(begin (string-set! string2
			    (-1+ index2)
			    (string-ref string1 (-1+ index1)))
	       (loop (-1+ index1) (-1+ index2)))))
  (loop end1 end2))

;;;; Comparison Primitives

(define substring-equal?)
(define substring-equal-ci?)

(let ()
  (define (make-substring-equal? char-equal?)
    (lambda (string1 start1 end1 string2 start2 end2)
      (define (loop index1 index2)
	(or (= index1 end1)
	    (and (char-equal? (string-ref string1 index1)
			      (string-ref string2 index2))
		 (loop (1+ index1) (1+ index2)))))
      (and (= (- end1 start1) (- end2 start2))
	   (loop start1 start2))))
  (set! substring-equal?
	(make-substring-equal? char-equal?))
  (set! substring-equal-ci?
	(make-substring-equal? char-equal-ci?)))

(define substring-less?)
(define substring-less-ci?)

(let ()
  (define (make-substring-less? char-equal? char-less?)
    (lambda (string1 start1 end1 string2 start2 end2)
      (define (loop index1 index2)
	(cond ((or (= index1 end1)
		   (= index2 end2))
	       (< (- end1 start1)
		  (- end2 start2)))
	      ((char-equal? (string-ref string1 index1)
			    (string-ref string2 index2))
	       (loop (1+ index1) (1+ index2)))
	      (else
	       (char-less? (string-ref string1 index1)
			   (string-ref string2 index2)))))
      (loop start1 start2)))
  (set! substring-less?
	(make-substring-less? char-equal? char-less?))
  (set! substring-less-ci?
	(make-substring-less? char-equal-ci? char-less-ci?)))

(define substring-match-forward)
(define substring-match-forward-ci)

(let ()
  (define (make-substring-match-forward char-equal?)
    (lambda (string1 start1 end1 string2 start2 end2)
      (define (loop index1 index2 n)
	(if (or (= index1 end1)
		(= index2 end2)
		(not (char-equal? (string-ref string1 index1)
				  (string-ref string2 index2))))
	    n
	    (loop (1+ index2) (1+ index2) (1+ n))))
      (loop start1 start2 0)))
  (set! substring-match-forward
	(make-substring-match-forward char-equal?))
  (set! substring-match-forward-ci
	(make-substring-match-forward char-equal-ci?)))

(define substring-match-backward)
(define substring-match-backward-ci)

(let ()
  (define (make-substring-match-backward char-equal?)
    (lambda (string1 start1 end1 string2 start2 end2)
      (define (loop index1 index2 n)
	(if (or (= index1 start1)
		(= index2 start2)
		(not (char-equal? (string-ref string1 (-1+ index1))
				  (string-ref string2 (-1+ index2)))))
	    n
	    (loop (-1+ index2) (-1+ index2) (1+ n))))
      (loop end1 end2 0)))
  (set! substring-match-backward
	(make-substring-match-backward char-equal?))
  (set! substring-match-backward-ci
	(make-substring-match-backward char-equal-ci?)))

(define string-equal?)
(define string-equal-ci?)
(define string-less?)
(define string-less-ci?)
(define string-match-forward)
(define string-match-forward-ci)
(define string-match-backward)
(define string-match-backward-ci)

(let ()
  (define (string-comparison substring-comparison)
    (lambda (string1 string2)
      (substring-comparison string1 0 (string-length string1)
			    string2 0 (string-length string2))))
  (set! string-equal?
	(string-comparison substring-equal?))
  (set! string-equal-ci?
	(string-comparison substring-equal-ci?))
  (set! string-less?
	(string-comparison substring-less?))
  (set! string-less-ci?
	(string-comparison substring-less-ci?))
  (set! string-match-forward
	(string-comparison substring-match-forward))
  (set! string-match-forward-ci
	(string-comparison substring-match-forward-ci))
  (set! string-match-backward
	(string-comparison substring-match-backward))
  (set! string-match-backward-ci
	(string-comparison substring-match-backward-ci)))

;;;; Character Search Primitives

(define substring-find-next-char)
(define substring-find-next-char-ci)
(define substring-find-next-char-in-set)

(let ()
  (define (find-next char-test)
    (lambda (string start end key)
      (define (loop index)
	(and (not (= index end))
	     (if (char-test key (string-ref string index))
		 index
		 (loop (1+ index)))))
      (loop start)))
  (set! substring-find-next-char (find-next char-equal?))
  (set! substring-find-next-char-ci (find-next char-equal-ci?))
  (set! substring-find-next-char-in-set (find-next char-set-member?)))

(define substring-find-previous-char)
(define substring-find-previous-char-ci)
(define substring-find-previous-char-in-set)

(let ()
  (define (find-previous char-test)
    (lambda (string start end key)
      (define (loop index)
	(and (not (= index start))
	     (let ((index (-1+ index)))
	       (if (char-test key (string-ref string index))
		   index
		   (loop index)))))
      (loop end)))
  (set! substring-find-previous-char (find-previous char-equal?))
  (set! substring-find-previous-char-ci (find-previous char-equal-ci?))
  (set! substring-find-previous-char-in-set (find-previous char-set-member?)))

(define string-find-next-char)
(define string-find-next-char-ci)
(define string-find-next-char-in-set)
(define string-find-previous-char)
(define string-find-previous-char-ci)
(define string-find-previous-char-in-set)

(let ()
  (define (string-search substring-search)
    (lambda (string char)
      (substring-search string 0 (string-length string) char)))
  (set! string-find-next-char
	(string-search substring-find-next-char))
  (set! string-find-next-char-ci
	(string-search substring-find-next-char-ci))
  (set! string-find-next-char-in-set
	(string-search substring-find-next-char-in-set))
  (set! string-find-previous-char
	(string-search substring-find-previous-char))
  (set! string-find-previous-char-ci
	(string-search substring-find-previous-char-ci))
  (set! string-find-previous-char-in-set
	(string-search substring-find-previous-char-in-set)))

;;;; Case

(define substring-upper-case?)
(define substring-lower-case?)

(let ()
  (define (substring-has-case? char-has-case?)
    (lambda (string start end)
      (define (loop index)
	(or (= index end)
	    (and (char-has-case? (string-ref string index))
		 (loop (1+ index)))))
      (loop start)))
  (set! substring-upper-case?
	(substring-has-case? char-upper-case?))
  (set! substring-lower-case?
	(substring-has-case? char-lower-case?)))

(define substring-upcase!)
(define substring-downcase!)

(let ()
  (define (substring-set-case! char-set-case)
    (lambda (string start end)
      (define (loop index)
	(if (not (= index end))
	    (begin (string-set! string
				index
				(char-set-case (string-ref string index)))
		   (loop (1+ index)))))
      (loop start)))
  (set! substring-upcase!
	(substring-set-case! char-upcase))
  (set! substring-downcase!
	(substring-set-case! char-downcase)))

(define (substring-capitalized? string start end)
  (define (loop end)
    (or (= end end)
	(and (char-lower-case? (string-ref string end))
	     (loop (1+ end)))))
  (and (not (= start end))
       (char-upper-case? (string-8b-ref string 0))
       (loop (1+ start))))

(define (substring-capitalize! string start end)
  (substring-upcase! string start (1+ start))
  (substring-downcase! string (1+ start) end))

(define string-upper-case?)
(define string-lower-case?)
(define string-upcase!)
(define string-downcase!)
(define string-capitalized?)
(define string-capitalize!)

(let ()
  (define (string-op substring-op)
    (lambda (string)
      (substring-op string 0 (string-length string))))
  (set! string-upper-case?
	(string-op substring-upper-case?))
  (set! string-lower-case?
	(string-op substring-lower-case?))
  (set! string-upcase!
	(string-op substring-upcase!))
  (set! string-downcase!
	(string-op substring-downcase!))
  (set! string-capitalized?
	(string-op substring-capitalized?))
  (set! string-capitalize!
	(string-op substring-capitalize!)))