Unicode
Scheme 48 fully supports ISO 10646 (Unicode): Scheme characters represent Unicode scalar values, and Scheme strings are arrays of scalar values. More information on Unicode can be found at the Unicode web site.
6.1 Characters and their codes
Scheme 48 internally represents characters as Unicode scalar values. The unicode structure contains procedures for converting between characters and scalar values:
Char->scalar-value returns the scalar value of a character, and scalar-value->char converts in the other direction. Scalar-value->char signals an error if passed an integer that is not a scalar value.
Note that the Unicode scalar value range is
In particular, this excludes the surrogates, which UTF-16 uses to encode scalar values with two 16-bit words. Note that this representation differs from that of Java, which uses UTF-16 code units as the character representation -- Scheme 48 effectively uses UTF-32, and is thus in line with other Scheme implementations and the current Unicode proposal for R6RS, as set forth in SRFI 75.
The R5RS procedures char->integer and integer->char are synonyms for char->scalar-value and scalar-value->char, respectively.
6.2 Character and string literals
The syntax specified here is in line with the current Unicode proposal for R6RS, as set forth in SRFI 75, except for case-sensitivity. (Scheme 48 is case-insensitive.)
6.2.1 Character literals
The following character names are available in addition to what R5RS provides:
#\nul
(ASCII 0)#\alarm
(ASCII 7)#\backspace
(ASCII 8)#\tab
(ASCII 9)#\vtab
(ASCII 11)#\page
(ASCII 12)#\return
(ASCII 13)#\esc
(ASCII 27)#\rubout
(ASCII 127)#\x
<x><x>... hex, explicitly or implicitly delimited, where <x><x>... denotes the scalar value of the character
6.2.2 String literals
The following escape characters in string literals are available in addition to what R5RS provides:
\a
: alarm (ASCII 7)\b
: backspace (ASCII 8)\t
: tab (ASCII 9)\n
: linefeed (ASCII 10)\v
: vertical tab (ASCII 11)\f
: formfeed (ASCII 12)\r
: return (ASCII 13)\e
: escape (ASCII 27)\'
: quote (ASCII 39, same as unquoted)\
<newline><intraline whitespace>: elided (allows a single-line string to span source lines)\x
<x><x>...;
hex, where <x><x>... denotes the scalar value of the character
6.2.3 Identifiers and symbol literals
Where R5RS allows a <letter>, Scheme 48 allows in addition any character whose scalar value is greater than 127 and whose Unicode general category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co.
Moreover, when a backslash appears in a symbol, it must start a
\x
<x><x>...;
escape, which identifies an
arbitrary character to include in the symbol. Note that a backslash
itself can be specified as \x5C;
.
6.3 Character classification and case mappings
The R5RS character predicates -- char-whitespace?, char-lower-case?, char-upper-case?, char-numeric?, and char-alphabetic? -- all treat the full Unicode range.
Char-upcase and char-downcase as well as char-ci=?, char-ci<?, char-ci<=?, char-ci>?, char-ci>=?, string-ci=?, string-ci<?, string-ci>?, string-ci<=?, string-ci>=? all use the standard simple locale-insensitive Unicode case folding.
In addition, Scheme 48 provides the unicode-char-maps structure for more complete access to the Unicode character classification with the following procedures and macros:
(general-category general-category-name) -> general-category (syntax)
The syntax general-category returns a Unicode general category object associated with general-category-name. (See Figure 2 below.) General-category? is the predicate for general-category objects. General-category-id returns the Unicode category id as a string (also listed in Figure 2). Char-general-category returns the general category of a character.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Figure 2: Unicode general categories and primary categories | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
(general-category-primary-category general-category) -> primary-category
(primary-category primary-category-name) -> primary-category (syntax)
General-category-primary-category maps the general category to its associated primary category -- also listed in Figure 2. The primary-category syntax returns the primary-category object associated with primary-category-name. Primary-category? is the predicate for primary-category objects.
The unicode-char-maps procedure also provides the following additional case-mapping procedures for characters:
Char-titlecase? tests if a character is in titlecase. Char-titlecase returns the titlecase counterpart of a character. Char-foldcase folds the case of a character, i.e. maps it to uppercase first, then to lowercase. The following case-mapping procedures on strings are available:
These implement the simple case mappings defined by the Unicode standard -- note that the length of the output string may be different from that of the input string.
6.4 SRFI 14
The SRFI 14 (``Character Sets'') implementation in the srfi-14 structure is fully Unicode-compliant.
6.5 R6RS
The unicode-r6rs structure exports the procedures from the (r6rs unicode) library of 5.91 draft of R6RS that are not already in the scheme structure:
string-normalize-nfd string-normalize-nfkd string-normalize-nfc string-normalize-nfkc char-titlecase char-title-case? char-foldcase string-upcase string-downcase string-foldcase string-titlecase |
6.6 I/O
Ports must encode any text a program writes to an output port to a byte sequence, and conversely decode byte sequences when a program reads text from an input port. Therefore, each port has an associated text codec that describes how encode and decode text.
Note that the interface to the text codec functionality is experimental and very likely to change in the future.
6.6.1 Text codecs
The i/o structure defines the following procedures:
These two procedures retrieve and set the text codec associated with a port, respectively. A program can set text codec of a port at any time, even if it has already performed I/O on the port.
The text-codecs structure defines the following procedures and macros:
Text-codec? is the predicate for text codecs. Null-text-codec is primarily meant for null ports that never yield input and swallow all output. The following text codecs implement the US-ASCII, Latin-1, Unicode UTF-8, Unicode UTF-16 (little-endian), Unicode UTF-16 (big-endian), Unicode UTF-32 (little-endian), Unicode UTF-32 (big-endian) encodings, respectively.
Find-text-codec finds the codec associated with an encoding
name. The names of the above encodings are "null"
,
"US-ASCII"
, "ISO8859-1"
, "UTF-8"
,
"UTF-16LE"
, "UTF-16BE"
, "UTF-32LE"
, and
"UTF-32BE"
, respectively.
6.6.2 Text-codec utilities
The text-codec-utils structure exports a few utilities for dealing with text codecs:
These procedures look at the byte-order-mark (also called the ``BOM'', U+FEFF) at the beginning of a port and guess the appropriate text codec. This works only for UTF-16 (little-endian and big-endian) and UTF-8. Guess-port-text-codec-according-to-bom returns the text codec, or #f if it found no UTF-16 or UTF-8 BOM. Note that this actually reads from the port. If the guess does not succeed, it is probably a good idea to re-open the port. Set-port-text-codec-according-to-bom! calls guess-port-text-codec-according-to-bom, sets the port text codec to the result if successful and returns #t. If it is not successful, it returns #f. As with guess-port-text-codec-according-to-bom, this reads from the port, whether successful or not.
6.6.3 Creating text codecs
(make-text-codec strings encode-proc decode-proc) -> text-codec
(define-text-codec id name encode-proc decode-proc) (syntax)
(define-text-codec id (name ...) encode-proc decode-proc) (syntax)
Make-text-codec constructs a text codec from a list of names, and an encode and a decode procedure. (See below on how to construct encode and decode procedures.) Text-codec-names, text-codec-encode-char-proc, and text-codec-decode-char-proc are the accessors for text codec. The define-text-codec is a shorthand for binding a global identifier to a text codec. Its first form is for codecs with only one name, the second for codecs with several names.
Encoding and decoding procedures work as follows:
(encode-proc char buffer start count) -> boolean maybe-count
(decode-proc buffer start count) -> maybe-char count
An encode-proc consumes a character char to encode, a byte vector buffer to receive the encoding, an index start into the buffer, and a block size count. It is supposed to encode the bytes into the block at [start, start + count). If the encoding is successful, the procedure must return #t and the number of bytes needed by the encoding. If the character cannot be encoded at all, the procedure must return #f and #f. If the encoding is possible but the space is not sufficient, the procedure must return #f and a total number of bytes needed for the encoding.
A decode-proc consumes a byte vector buffer, an index start into the buffer, and a block size count. It is supposed to decode the bytes at indices [start, start + count). If the decoding is successul, it must return the decoded character at the beginning of the block, and the number of bytes consumed. If the block cannot begin with or be a prefix of a valid encoding, the procedure must return #f and #f. If the block contains a true prefix of a valid encoding, the procedure must return #f and a total count of bytes (including the buffer) needed to complete the encoding. Note that this byte count is only a guess: the system will provide that many bytes, but the decoding procedures might still signal an incomplete encoding, causing the system to try to obtain more.
6.7 Default encodings
The default encoding for new ports is UTF-8. For the default current-input-port, current-output-port, and current-error-port, Scheme 48 consults the OS for encoding information.
For Unix, it consults nl_langinfo(3), which in turn consults the LC_ environment variables. If the encoding is not defined that way, Scheme 48 reverts to US-ASCII.
Under Windows, Scheme 48 uses Unicode I/O (using UTF-16) for the default ports connected to the console, and Latin-1 for default ports that are not.