This section describes a functional interface for building regular
expressions and matching them against strings.
The matching is done using the POSIX regular expression package.
Regular expressions are in the structure regexps
.
A regular expression is either a character set, which matches any character in the set, or a composite expression containing one or more subexpressions. A regular expression can be matched against a string to determine success or failure, and to determine the substrings matched by particular subexpressions.
Character sets may be defined using a list of characters and strings, using a range or ranges of characters, or by using set operations on existing character sets.
(set character-or-string ...) -> char-set
(range low-char high-char) -> char-set
(ranges low-char high-char ...) -> char-set
(ascii-range low-char high-char) -> char-set
(ascii-ranges low-char high-char ...) -> char-set
Set
returns a set that contains the character arguments and the
characters in any string arguments. Range
returns a character
set that contain all characters between low-char
and high-char
,
inclusive. Ranges
returns a set that contains all characters in
the given ranges. Range
and ranges
use the ordering induced by
char->integer
. Ascii-range
and ascii-ranges
use the
ASCII ordering.
It is an error for a high-char
to be less than the preceding
low-char
in the appropriate ordering.
(negate char-set) -> char-set
(intersection char-set char-set) -> char-set
(union char-set char-set) -> char-set
(subtract char-set char-set) -> char-set
The following character sets are predefined:
The above are taken from the default locale in POSIX. The characters in
lower-case
(set "abcdefghijklmnopqrstuvwxyz")
upper-case
(set "ABCDEFGHIJKLMNOPQRSTUVWXYZ")
alphabetic
(union lower-case upper-case)
numeric
(set "0123456789")
alphanumeric
(union alphabetic numeric)
punctuation
(set "
!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~
")
graphic
(union alphanumeric punctuation)
printing
(union graphic (set #
\
space))
control
(negate printing)
blank
(set #
\
space (ascii->char 9))
; 9 is tabwhitespace
(union (set #
\
space) (ascii-range 9 13))
hexdigit
(set "0123456789abcdefABCDEF")
whitespace
are space
, tab
,
newline
(= line feed
), vertical tab
, form feed
, and
carriage return
.
String-start
returns a regular expression that matches the beginning
of the string being matched against; string-end returns one that matches
the end.
Sequence
matches the concatenation of its arguments, one-of
matches
any one of its arguments.
Text
returns a regular expression that matches the characters in
string
, in order.
Repeat
returns a regular expression that matches zero or more
occurences of its reg-exp
argument. With no count the result
will match any number of times (reg-exp
*). With a single
count the returned expression will match
reg-exp
exactly that number of times.
The final case will match from min
to max
repetitions, inclusive.
Max
may be #f
, in which case there
is no maximum number of matches.
Count
and min
should be exact, non-negative integers;
max
should either be an exact non-negative integer or #f
.
Regular expressions are normally case-sensitive.
The value returned byignore-case
is identical its argument except that case will be
ignored when matching.
The value returned by use-case
is protected
from future applications of ignore-case
.
The expressions returned
by use-case
and ignore-case
are unaffected by later uses of the
these procedures.
By way of example, the following matches "ab"
but not "aB"
,
"Ab"
, or "AB"
.
while(text "ab")
matches(ignore-case (test "ab"))
"ab"
, "aB"
,
"Ab"
, and "AB"
and
matches(ignore-case (sequence (text "a") (use-case (text "b"))))
"ab"
and "Ab"
but not "aB"
or "AB"
.
A subexpression within a larger expression can be marked as a submatch. When an expression is matched against a string, the success or failure of each submatch within that expression is reported, as well as the location of the substring matched be each successful submatch.
Submatch
returns a regular expression that matches its argument and
causes the result of matching its argument to be reported by the match
procedure.
Key
is used to indicate the result of this particular submatch
in the alist of successful submatches returned by match
.
Any value may be used as a key
.
No-submatches
returns an expression identical to its
argument, except that all submatches have been elided.
(any-match? reg-exp string) -> boolean
(exact-match? reg-exp string) -> boolean
(match reg-exp string) -> match or #f
(match-start match) -> index
(match-end match) -> index
(match-submatches match) -> alist
Any-match?
returns #t
if string
matches reg-exp
or
contains a substring that does, and #f
otherwise.
Exact-match?
returns #t
if string
matches
reg-exp
and #f
otherwise.
Match
returns #f
if reg-exp
does not match string
and a match record if it does match.
A match record contains three values: the beginning and end of the substring
that matched
the pattern and an a-list of submatch keys and corresponding match records
for any submatches that also matched.
Match-start
returns the index of
the first character in the matching substring and match-end
gives index
of the first character after the matching substring.
Match-submatches
returns an alist of submatch keys and match records.
Only the top match record returned by match
has a submatch alist.
Matching occurs according to POSIX.
The match returned is the one with the lowest starting index in string
.
If there is more than one such match, the longest is returned.
Within that match the longest possible submatches are returned.
All three matching procedures cache a compiled version of reg-exp
.
Subsequent calls with the same reg-exp
will be more efficient.
The C interface to the POSIX regular expression code uses ASCII nul
as an end-of-string marker.
The matching procedures will ignore any characters following an
embedded ASCII nul
s in string
.
(define pattern (text "abc")) (any-match? pattern "abc")->
#t (any-match? pattern "abx")->
#f (any-match? pattern "xxabcxx")->
#t (exact-match? pattern "abc")->
#t (exact-match? pattern "abx")->
#f (exact-match? pattern "xxabcxx")->
#f (match pattern "abc")->
(#{match 0 3}) (match pattern "abx")->
#f (match pattern "xxabcxx")->
(#{match 2 5}) (let ((x (match (sequence (text "ab") (submatch 'foo (text "cd")) (text "ef")) "xxxabcdefxx"))) (list x (match-submatches x)))->
(#{match 3 9} ((foo . #{match 5 7})) (match-submatches (match (sequence (set "a") (one-of (submatch 'foo (text "bc")) (submatch 'bar (text "BC")))) "xxxaBCd"))->
((bar . #{match 4 6}))
Previous: Macros for writing loops | Next: SRFIs