Scheme 48 Manual -- Regular expressions

Scheme 48 Manual | Contents | In Chapter: Libraries
Previous: Macros for writing loops | Next: SRFIs

Regular expressions

This section describes a functional interface for building regular expressions and matching them against strings. The matching is done using the POSIX regular expression package. Regular expressions are in the structure regexps.

A regular expression is either a character set, which matches any character in the set, or a composite expression containing one or more subexpressions. A regular expression can be matched against a string to determine success or failure, and to determine the substrings matched by particular subexpressions.

Character sets

Character sets may be defined using a list of characters and strings, using a range or ranges of characters, or by using set operations on existing character sets.

(set character-or-string ...) -> char-set
(range low-char high-char) -> char-set
(ranges low-char high-char ...) -> char-set
(ascii-range low-char high-char) -> char-set
(ascii-ranges low-char high-char ...) -> char-set

Set returns a set that contains the character arguments and the characters in any string arguments. Range returns a character set that contain all characters between low-char and high-char, inclusive. Ranges returns a set that contains all characters in the given ranges. Range and ranges use the ordering induced by char->integer. Ascii-range and ascii-ranges use the ASCII ordering. It is an error for a high-char to be less than the preceding low-char in the appropriate ordering.

(negate char-set) -> char-set
(intersection char-set char-set) -> char-set
(union char-set char-set) -> char-set
(subtract char-set char-set) -> char-set

These perform the indicated operations on character sets.

The following character sets are predefined:

lower-case (set "abcdefghijklmnopqrstuvwxyz")

upper-case (set "ABCDEFGHIJKLMNOPQRSTUVWXYZ")

alphabetic (union lower-case upper-case)

numeric (set "0123456789")

alphanumeric (union alphabetic numeric)

punctuation (set "!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~")

graphic (union alphanumeric punctuation)

printing (union graphic (set #\space))

control (negate printing)

blank (set #\space (ascii->char 9)) ; 9 is tab

whitespace (union (set #\space) (ascii-range 9 13))

hexdigit (set "0123456789abcdefABCDEF")

The above are taken from the default locale in POSIX. The characters in whitespace are space, tab, newline (= line feed), vertical tab, form feed, and carriage return.

Anchoring

(string-start) -> reg-exp
(string-end) -> reg-exp

String-start returns a regular expression that matches the beginning of the string being matched against; string-end returns one that matches the end.

Composite expressions

(sequence reg-exp ...) -> reg-exp
(one-of reg-exp ...) -> reg-exp

Sequence matches the concatenation of its arguments, one-of matches any one of its arguments.

(text string) -> reg-exp

Text returns a regular expression that matches the characters in string, in order.

(repeat reg-exp) -> reg-exp
(repeat count reg-exp) -> reg-exp
(repeat min max reg-exp) -> reg-exp

Repeat returns a regular expression that matches zero or more occurences of its reg-exp argument. With no count the result will match any number of times (reg-exp*). With a single count the returned expression will match reg-exp exactly that number of times. The final case will match from min to max repetitions, inclusive. Max may be #f, in which case there is no maximum number of matches. Count and min should be exact, non-negative integers; max should either be an exact non-negative integer or #f.

Case sensitivity

Regular expressions are normally case-sensitive.

(ignore-case reg-exp) -> reg-exp
(use-case reg-exp) -> reg-exp

The value returned by ignore-case is identical its argument except that case will be ignored when matching. The value returned by use-case is protected from future applications of ignore-case. The expressions returned by use-case and ignore-case are unaffected by later uses of the these procedures. By way of example, the following matches "ab" but not "aB", "Ab", or "AB".

(text "ab")

while

(ignore-case (test "ab"))

matches "ab", "aB", "Ab", and "AB" and

(ignore-case (sequence (text "a")
                       (use-case (text "b"))))

matches "ab" and "Ab" but not "aB" or "AB".

Submatches and matching

A subexpression within a larger expression can be marked as a submatch. When an expression is matched against a string, the success or failure of each submatch within that expression is reported, as well as the location of the substring matched be each successful submatch.

(submatch key reg-exp) -> reg-exp
(no-submatches reg-exp) -> reg-exp

Submatch returns a regular expression that matches its argument and causes the result of matching its argument to be reported by the match procedure. Key is used to indicate the result of this particular submatch in the alist of successful submatches returned by match. Any value may be used as a key. No-submatches returns an expression identical to its argument, except that all submatches have been elided.

(any-match? reg-exp string) -> boolean
(exact-match? reg-exp string) -> boolean
(match reg-exp string) -> match or #f
(match-start match) -> index
(match-end match) -> index
(match-submatches match) -> alist

Any-match? returns #t if string matches reg-exp or contains a substring that does, and #f otherwise. Exact-match? returns #t if string matches reg-exp and #f otherwise.

Match returns #f if reg-exp does not match string and a match record if it does match. A match record contains three values: the beginning and end of the substring that matched the pattern and an a-list of submatch keys and corresponding match records for any submatches that also matched. Match-start returns the index of the first character in the matching substring and match-end gives index of the first character after the matching substring. Match-submatches returns an alist of submatch keys and match records. Only the top match record returned by match has a submatch alist.

Matching occurs according to POSIX. The match returned is the one with the lowest starting index in string. If there is more than one such match, the longest is returned. Within that match the longest possible submatches are returned.

All three matching procedures cache a compiled version of reg-exp. Subsequent calls with the same reg-exp will be more efficient.

The C interface to the POSIX regular expression code uses ASCII nul as an end-of-string marker. The matching procedures will ignore any characters following an embedded ASCII nuls in string.

(define pattern (text "abc"))
(any-match? pattern "abc")         -> #t
(any-match? pattern "abx")         -> #f
(any-match? pattern "xxabcxx")     -> #t

(exact-match? pattern "abc")       -> #t
(exact-match? pattern "abx")       -> #f
(exact-match? pattern "xxabcxx")   -> #f

(match pattern "abc")              -> (#{match 0 3})
(match pattern "abx")              -> #f
(match pattern "xxabcxx")          -> (#{match 2 5})

(let ((x (match (sequence (text "ab")
                          (submatch 'foo (text "cd"))
                          (text "ef"))
                "xxxabcdefxx")))
  (list x (match-submatches x)))
  -> (#{match 3 9} ((foo . #{match 5 7}))

(match-submatches
  (match (sequence
           (set "a")
           (one-of (submatch 'foo (text "bc"))
                   (submatch 'bar (text "BC"))))
         "xxxaBCd"))
  -> ((bar . #{match 4 6}))

Previous: Macros for writing loops | Next: SRFIs

`lower-case`	`(set "abcdefghijklmnopqrstuvwxyz")`
`upper-case`	`(set "ABCDEFGHIJKLMNOPQRSTUVWXYZ")`
`alphabetic`	`(union lower-case upper-case)`
`numeric`	`(set "0123456789")`
`alphanumeric`	`(union alphabetic numeric)`
`punctuation`	`(set "`!\"#$%&'()*+,-./:;<=>?@[\\]^_`{\|}~`")`
`graphic`	`(union alphanumeric punctuation)`
`printing`	`(union graphic (set #\space))`
`control`	`(negate printing)`
`blank`	`(set #\space (ascii->char 9))` ; 9 is tab
`whitespace`	`(union (set #\space) (ascii-range 9 13))`
`hexdigit`	`(set "0123456789abcdefABCDEF")`