Unicode in Tcl 9

1 About this document

Tcl 9 has significantly expanded support for Unicode characters compared to Tcl 8. The changes are documented across multiple TIP’s and manpages which makes it difficult for some people (meaning me) to get their head around the functionality. This document will eventually consolidate Tcl 9 Unicode related support in one place.

However, at the moment its purpose is to document how the current implementation works and highlight what many of us consider major deficiencies in the API. Although the debate so far has centered around channel I/O and strictness related options, there are other issues, some of which might just be bugs but others are more substantial.

The hope is this will allow folks not following the debates on the mailing list and elsewhere to catch up and express their views.

Note: this area is still under debate and development on multiple branches. What follows is the behavior of the commit fb44cd608e43667eef4beeefa87b81a775470666 on the main branch.

2 Background

The following definitions from Chapter 3 (PDF) of the Unicode standard are relevant for the discussion below.

D9 Unicode codespace: A range of integers from 0 to 0x10FFFF.
D10 Code point: Any value in the Unicode codespace.
D10a Code point type: Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.
D12 Coded character sequence: An ordered sequence of one or more code points. Note here the work coded does not refer to the encoding transforms like UTF-8 but rather the mapping of abstract characters to integer code points.
D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.
D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange.
D78 Code unit sequence: An ordered sequence of one or more code units.
D80 Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.
Conformant error handling of invalid encoded byte sequences is defined in 3.2 C10, 3.2 D93 and 5.22. These should either raise an error or replace the sequence with the appropriate (defined there) number of U+FFFD REPLACEMENT CHARACTER code points. In addition, as a special case for handling file names and similar which require lossless conversion, Unicode TR #36 3.7 recommends several alternatives including PEP 383.

3 Tcl strings

In Tcl 9, a string at the script level seems to correspond to a sequence of Unicode code points as defined in the standard. In particular,

It is not a sequence of abstract characters as defined in the standard as it allows for code points that must not be interpreted as characters.
It is not a sequence of glyphs or graphemes in the manner a human reader might recognize as characters.
It is not a sequence of Unicode scalar values as the latter does not include values in the high/low surrogage range.
It is not a Unicode string as defined in the standard as that is defined in terms of encoding forms which in turn are defined in terms of Unicode scalar values.

However, while at the script level strings can only be constructed as a a sequence of Unicode code points in the range U+0:U+10FFFF, at the C level Tcl_NewUnicodeObj etc. allow a Tcl string to contain values outside that range. It is not clear if this is legal or whether no check is made because of the performance cost. This leads to inconsistent internal Tcl_Obj structures.

3.1 ASCII escape sequences for non-ASCII code points

There are multiple ways to include non-ASCII code points in string literals in ASCII program text. With X denoting a hexadecimal digit,

\xXX for code points in the range 0-0xFF.
\uXXXX for code points in the range 0-0xFFFF.
\UXXXXXXXX for code points in the range 0-0x10FFFF.

Note that as the documentation of the \U form states, up to eight hex digits may be specified but the parsing of digits stops if the resulting value would exceed 0x10FFFF. So for example,

% scan \U10FFFFFF %c%c%c
1114111 70 70

It is not possible to generate code points above 0x10FFFF using this notation. That is assumed intentional as values in that range are not actually valid code points.

3.2 Binary strings

Binary strings in Tcl 9 are simply coded character sequences (Tcl strings) where each code point is in the range 0-255. The Tcl binary command operates on such strings as though they were binary data (sequence of bytes).

Note there is a change from Tcl 8 in the behavior of the binary command when its operand contains code points above 255. In Tcl 8, the higher order bits would be ignored.

% package require Tcl
8.6.13
% binary encode hex \UFF
ff
% binary encode hex \U100
00

In Tcl 9, an error is raised.

% package require Tcl
9.0a4
% binary encode hex \UFF
ff
% binary encode hex \U100
expected byte sequence but character 0 was 'Ā' (U+000100)

3.3 Issues in string definition

3.3.1 No definition of what constitutes a Tcl string

I have not found any manpage or TIP that documents what constitutes a Tcl string. As discussed earlier, it appears from behavior that a Tcl string is a sequence of code points but that should be explicitly documented else it leads to confusion as to whether surrogate code points may be present in the string, whether out of range code points are allowed etc. This leads to the issue described in the following section.

On a related but lesser note, the use of the term Unicode string in the Tcl manpages is not even remotely similar to the definition of Unicode string in the Unicode standard (see earlier).

My preference would be to define strings as sequences of Unicode code points and change all mentions of Unicode strings to just simply strings.

Along the same lines, the Tcl man pages use the term character when they should really use code point. I’m ambivalent as to whether this should be changed to be more accurate as character can be interpreted multiple ways, none of which match what Tcl considers a character. On the hand, for most readers, the term character is more natural.

3.3.2 Inconsistent handling for out of range code points

Irrespective of whether values outside U+0:U+10FFFF are legal or not, it is important that a Tcl_Obj object be consistent in its internal structure. Currently this is not the case. Values outside that range (for example 0x7FFFFFFF) result in a Tcl_Obj whose bytes field contains the UTF-8 sequence \xef\xbf\xbd (corresponding to U+FFFD) while the structure’s internal representation String.unicode contains the original value (0x7FFFFFFF). This inconsistency implies the result of operations on that object would depend on whether the implementation used the byte representation or the internal representation. This is not a good thing. The following illustrates potential for anomalous behavior (teststringobj newunicode maps to Tcl_NewUnicodeObj for testing purposes):

% set u \uFFFD
% set c [teststringobj newunicode 1 0x7fffff7f]
% regexp $u $c
0
% string equal $u $c
1

The above is because regexp works with the String.unicode internal representation while string equal uses the bytes field.

Furthermore, Tcl internally combines flags with the code point values, e.g. the value 0x1000001 passed in via Tcl_NewUnicodeObj will interpreted as the flag TCL_COMBINE (defined as 0x1000000) or-ed with the code point U+0001 when passed to internal Tcl encoding functions. This misinterpretation, confusing data bits passed in as internal flags, also makes me very uncomfortable though I’m not sure what the ramifications are.

Assuming integer values above 0x10FFFF are in fact illegal, fixing this by adding checks to the C API and returning errors does not seem feasible:

There would be a performance hit for large strings.
Even if the performance hit was tolerable, semantics of some of the C APIs do not allow for failures and changing that would seriously break compatibility.

Alternatives are to

Change Tcl string semantics to allow for code points above U+10FFFF. Tcl’s internal UTF-8 encoding would need to change to allow for 6 byte UTF-8 sequences. Note this does not mean Tcl’s external encoders would treat these code points as legal.
Alternatively, replace the out of range code points with U+FFFD REPLACEMENT CHARACTER and document the C API accordingly. This is in effect what the current implementation does but only partially as described above. The implementation would have to change to fix up the String representation as well. This could be done either at the time the C API’s are called, or if that is considered too detrimental to performance, lazily when encountered in string operations that use the String.unicode representation.

3.3.3 Surrogates as literals

The Tcl manpage states

The range U+00D800–U+00DFFF is reserved for surrogates, which are illegal on their own. Therefore, such sequences will result in the replacement character U+FFFD.

This is not how the implementation behaves.

% format %x [scan \U00D800 %c]
d800

My assumption is the manpage is wrong and needs to be corrected as Tcl strings permit inclusion of values in the surrogate range.

3.3.4 Variable length escape sequences

The \U and \u escape sequences have variable lengths. This is probably by design as in Tcl 8 and not likely to be changed. I am listing it here because I find it confusing from a readability perspective. As an example, guess the result of

string length \UABCDEF

4 String commands

As a general rule, Tcl’s string related commands (string, regexp) are “Unicode-aware”. However, they operate on sequences of zero or more code points, i.e. coded character sequences as defined above and in the Unicode standard. Tcl does not operate on abstract characters as defined in the Unicode standard or on glyphs or graphemes which is how humans would recognize as characters.

Two coded characters sequences that represent the same abstract character will not be treated as equal. For example, the abstract character e with acute may be represented by the precomposed code point U+00E9 (Latin Small Letter E with Acute) or as the decomposed form U+0065, U+0301 (Latin Small Letter E followed by Combining Acute Accent). These are treated as different strings in Tcl even though they represent the same grapheme and would be considered the same by a human assuming a competent display driver.

This behavior is reflected through all Tcl commands and lead to what might seem to be anomalous behavior. The string length command would return 2 for the length of \u0065\u0301 though the display would only show a single character. Similarly, string index would return U+0065 (e) while the user may expect to see é based on what is seen in the wish console.

String comparisons and sorting is done by comparing the numeric values of the code points in each coded character sequence and not using any locale information. Thus when sorting strings, the letter f would sort before the precomposed character U+00e9 (\u00e9) but after the equivalent U+0065, U+0301 sequence. This can lead to visually surprising results, for example, when displaying a sorted list of file names.

TODO Does the lsort -dictionary option understand character case and digits outside of ASCII?

4.1 String classification

Tcl’s string is command has several subcommands for character classes, such as isspace, isdigit. These are Unicode aware. For example, for U+0967, 1 in the Devanagiri script,

% string is digit \u0967
1

There is one new classification command in Tcl 9, string is unicode.

Experimentally, it appears that the term refers to any code point other than those in the surrogate and noncharacter categories as defined in the Unicode standard.

% string is unicode \uD800; # Surrogates
0
% string is unicode \uFDD0; # Noncharacter code point
0
% string is unicode \uE000; # Private use
1
% string is unicode \UE4000; # Unassigned/reserved
1
% string is unicode \u001F; # Control
1
% string is unicode \u0020; # Graphic
1
% string is unicode \u200E; # Format
1

Note that string is unicode cannot be used to check for abstract characters as it returns 1 for unassigned code points which are not to be treated as abstract characters per the standard.

4.2 Issues in string commands

4.2.1 `string is unicode`

There are a couple of issues with the new string is unicode command apart from the fact that it is not currently documented.

The name of the command is ambiguous and therefore confusing. Does unicode refer to Unicode code points, Unicode scalar values, Unicode strings, or what? Experimentally it appears, as shown above, that it refers to a subset of Unicode character categories. This is not at all clear from the name.

Perhaps the command string is unicode should be renamed to be string is abstractchar or string is char.

Further, it is not very clear where this command is useful. It has been suggested that this command can be used to check whether a Tcl string value can be conformantly transformed via an UTF encoding for transmission. However, this use does not hold because transmission of noncharacter code points is explicitly permitted by the standard.

TIP 652 discusses this in further detail and suggests changing the command to correspond to the categories as defined in the standard. There has been no discussion on the TIP as yet.

4.2.2 Nonconformant interpretation of string values

The Unicode standard explicitly warns against interpretation of code points in categories Surrogate and Noncharacter as characters when working with sequences of characters. The string commands operate on code points and violates this.

Given that Tcl strings have been implicitly defined as sequences of code points and not characters, it is not clear much can be done about this other than documenting that Tcl strings are not strings of characters as defined in the Unicode standard.

5 Encoding transforms

When exchanging data between processes, via file, network etc., the Tcl string values have to transformed to and from a sequence of bytes using some encoding. This transform may be done either explicitly with the encoding command as part of I/O by configuring a channel with the -encoding option. This section only discusses the former. I/O is discussed in a later section.

Relevant definitions from the Unicode standard are

D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange.

D78 Code unit sequence: An ordered sequence of one or more code units.

D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. Note the reference to scalar value which means surrogate code points should never be transformed to any of the UTF encoding forms. (They can appear in a UTF-16 stream as a result of transforming a code point outside the BMP).

The standard defines Unicode encoding forms UTF-8, UTF-16 and UTF-32. Note however that Tcl supports many “traditional” encoding like cp1252 as well and any discussion needs to include these.

The major discussion point with respective to encoding transforms is dealing with error cases:

defining what constitutes an error. This may depend on the command options in effect.
how errors are reported or handled.

The type of errors encountered depends on the operation - encoding versus decoding.

NOTE: Most of the discussion below is related to UTF encoders unless stated otherwise.

5.1 Transforming encoded byte sequences to Tcl strings

The following errors may encountered when decoding an encoded byte stream:

Case 1. The byte value may be one that should never appear in the specified encoding or at a particular position in a multibyte encoding. For example, the values \xC0 and \xC1 should never appear at any point in a UTF-8 encoded byte sequence. As an example of the latter, the byte \xE0 (amongst others) should never appear as the lead byte in ShiftJIS.

Case 2. The rules for the encoding do not permit the value to have been encoded in the first place. For example, surrogate Unicode code points should never be encoded and thus should be treated as an error when encountered during a decoding operation. (Note the surrogate could appear in the UTF-16 encoded byte sequence. But the decoded value should never be a surrogate code point.)

Case 3. A byte subsequence within a byte sequence that is encoded with a multibyte encoding terminates prematurely. This may or may not be an error depending whether the subsequence is in the middle of the containing byte sequence or at the end. In the latter case, it may just mean more bytes are needed as may happen when data is read over a streaming interface. For example, the UTF-8 sequence \xC2\x41 is a hard error as there is no trailing byte succeeding the lead byte \xC2 (\x41 cannot be a trailing byte). On the other hand, the sequence \x41\xC2 may not be an error because additional data may arrive containing a valid trailing byte to complete the \xC2.

Case 4. The decoded values may lie outside the range of Unicode code points. For example the UTF-32 encoded sequence \x7F\xFF\xFF\x7F trivially transates to the integer value U+7FFFFF7F which is greater than the largest valid code point U+10FFFF. This is distinguished from the Case 2 because it is treated differently by Tcl.

How the various error cases above are handled by the encoding convertfrom command depends on several options.

By default, when the encoding is one of the UTF encoders, the command will not raise an error for any case except Case 4 above.

If an invalid byte or byte sequence is detected, it is simply transformed to the Unicode code point with the same integer value. In the examples below, \xC0 and \xC1 are invalid at any position in a UTF-8 byte sequence while \x80 is valid only as a trailing byte preceded by a valid lead byte.
Surrogates are happily accepted, though again they should not appear as encoded form in UTF-8 streams.
Incomplete encoded sequences are treated similar to the first case. A byte containing \xC2 is lead byte and should have a valid trailing byte following it.
The only time an error is raised with the default behavior is when the decoded value lies outside the range of Unicode code points.

Note the above constitutes non-conformant behavior.

Examples of the above error cases:

proc codepoints {s} {join [lmap c [split $s ""] {
    string cat U+ [format %.6X [scan $c %c]]}]
}
% codepoints [encoding convertfrom utf-8 \xC1\x80]; # Case 1
U+0000C1 U+000080
% codepoints [encoding convertfrom utf-8 \xC0\x81]; # Case 1
U+0000C0 U+000081
% codepoints [encoding convertfrom utf-8 \xC0\x80]; # Case 1 - Special handling!
U+000000
% codepoints [encoding convertfrom utf-8 \xed\xa0\x80]; # Case 2 - surrogate
U+00D800
% codepoints [encoding convertfrom utf-8 \xC2\x81]; # Case 3 - \xC2 followed by valid trail byte (valid UTF-8)
U+000081
% codepoints [encoding convertfrom utf-8 \xC2\x41]; # Case 3 - \xC2 followed by invalid trail byte
U+0000C2 U+000041
% codepoints [encoding convertfrom utf-8 \xC2]; # Case 3 - \xC2 with nothing following
U+0000C2
% encoding convertfrom utf-32 \x7f\xff\xff\x7f; # Case 4 raises an error
unexpected byte sequence starting at index 0: '\x7F'

Note the special treatment specifically for the pair \xC0\x80.

This default behavior not only violates the Unicode standard but also results in unexpected data modification.

-strict option

Strict conformance to the Unicode standard can be enforced with the -strict option. In this case all the error cases raise exceptions.

% encoding convertfrom -strict utf-8 \xC0\x81; # Case 1
unexpected byte sequence starting at index 0: '\xC0'
% encoding convertfrom -strict utf-8 \xC0\x81; # No special handling for \xC0\x80
unexpected byte sequence starting at index 0: '\xC0'
% encoding convertfrom -strict utf-8 \xed\xa0\x80; # Case 2 - surrogate
unexpected byte sequence starting at index 0: '\xED'
% encoding convertfrom -strict utf-8 \xC2\x41; # Case 3 - invalid trail byte
unexpected byte sequence starting at index 0: '\xC2'
% encoding convertfrom -strict utf-8 \xC2; # Case 3 - premature termination
unexpected byte sequence starting at index 0: '\xC2'
% encoding convertfrom -strict utf-32 \x7f\xff\xff\x7f; # Case 4 raises an error as before
unexpected byte sequence starting at index 0: '\x7F'

As far as I know, use of -strict makes the encoding convertfrom command conformant with the Unicode standard.

-nocomplain option

The final option that affects whether invalidly encoded bytes are treated as errors is the -nocomplain. This is the opposite of -strict in the sense that it never raises an exception. In particular, it differs from the default behavior in handling of Case 4 above. When the decoded value is outside the Unicode code point range, instead of raising an error, it replaces the value with the U+FFFD REPLACEMENT CHARACTER as defined in the Unicode standard.

% codepoints [encoding convertfrom -nocomplain utf-32 \x7f\xff\xff\x7f]
U+00FFFD

-failindex option

The failindex option differs from the -strict and -nocomplain options in that it does not change what is considered an error but rather changes the error reporting mechanism. When this option is specified, on detecting an error encoding convertfrom will return the string successfully decoded up to the failing offset. Additionally, the failing offset will be returned in the variable passed as the -failindex option value.

% encoding convertfrom utf-32le \x41\x00\x00\x00\x7f\xff\xff\x7f
unexpected byte sequence starting at index 4: '\x7F'
% encoding convertfrom -failindex fidx utf-32le \x41\x00\x00\x00\x7f\xff\xff\x7f
A
% set fidx
4

The use case for this option is incremental decoding of binary data that is fragmented (for example coming over a socket). A toy example would be

% foreach part [list \x41\x00\00\x00\x42 \x00\x00\x00] {
    append remaining $part
    puts [encoding convertfrom -failindex fidx utf-32le $remaining]
    if {$fidx == -1} {
        set remaining ""
    } else {
        set remaining [string range $remaining $fidx end]
    }
}
A
B

Use of this option as above has a significant issue in that there is no distinction between a failure due to an invalid byte value (hard error) versus a failure because the complete binary stream is not available (more data needed).

5.2 Transforming Tcl strings to encoded byte sequences

Going in the other direction, converting Tcl strings to a encoded byte sequence, also has potential for errors.

Case 1. The encoding does not support the Unicode code point. For examples, code points higher than U+00FF are not supported in the ASCII encoding.

Case 2. The encoding may be able to encode a Unicode code point but the rules for the encoding do not allow it. For example, the Unicode standard for UTF-8 encoding prohibits encoding of surrogate code points. So although the surrogate U+DC00 can be encoded as the byte sequence \xED\xB0\x80, it is prohibited by the standard.

Case 3: The value of the code point lies outside the valid code point range.

Tcl strings are transformed to binary strings (byte sequences) with the encoding convertto command.

The command will by default raise an error for both of the above cases.

% encoding convertto ascii A\u00e9B; # Case 1
unexpected character at index 1: 'U+0000E9'
% encoding convertto ascii \uDC00; # Case 2
unexpected character at index 0: 'U+00DC00'
% encoding convertto utf-8 \uDC00; # Case 2
unexpected character at index 0: 'U+00DC00'

However, no error is raised for Case 3 for UTF encodings (only!). U+FFFD is substituted instead.

% binary encode hex [encoding convertto ascii A[teststringobj newunicode 1 0x7fffff7f]B]; # Case 3
unexpected character at index 1: 'U+00FFFD'
% binary encode hex [encoding convertto utf-8 A[teststringobj newunicode 1 0x7fffff7f]B]; # Case 3
41efbfbd42

These discrepancies between Case 1/2 vs Case 3 as well as between ASCII and UTF-8 for Case 3 are by design for some reason?

It is not clear why the encoding convertto and encoding convertfrom commands differ in that the former raises an error by default and the latter does not.

-strict option

Although, the encoding convertto command has the -strict option like encoding convertfrom, it is not clear what effect it has. It seems to have no difference from the default behavior.

% encoding convertto -strict ascii A\u00e9B; # Case 1
unexpected character at index 1: 'U+0000E9'
% encoding convertto -strict ascii \uDC00; # Case 2
unexpected character at index 0: 'U+00DC00'
% encoding convertto -strict utf-8 \uDC00; # Case 2
unexpected character at index 0: 'U+00DC00'

-nocomplain option

The -nocomplain option turns off all raising of exceptions. For ASCII, and probably all non-UTF encodings, this results in in the code point being replaced by an encoding-specific character, generally the question mark.

% encoding convertto -nocomplain ascii A\xe9B
A?B
% encoding convertto -nocomplain ascii A\uDC00B
A?B
% encoding convertto -nocomplain ascii A[teststringobj newunicode 1 0x7fffff7f]B
A?B
% encoding convertto -nocomplain shiftjis A[teststringobj newunicode 1 0x7fffff7f]B
A?B

For UTF encodings, behavior is different. For Case 2 (Case 1 is not possible) the surrogate is output in encoded form. For Case 3, the Unicode U+FFFD REPLACEMENT CHARACTER is used to replace the invalid code point as for the other encodings.

% binary encode hex [encoding convertto -nocomplain utf-8 A\uDC00B]; # Case 2
41edb08042
% binary encode hex [encoding convertto -nocomplain utf-32 A\uDC00B]; # Case 2
4100000000dc000042000000
% binary encode hex [encoding convertto -nocomplain utf-8 A[teststringobj newunicode 1 0x7fffff7f]B]; # Case 3
41efbfbd42
% binary encode hex [encoding convertto -nocomplain utf-32 A[teststringobj newunicode 1 0x7fffff7f]B]; # Case 3
41000000fdff000042000000

The resulting UTF byte sequence will be conformant for Case 3, but not Case 2.

-failindex option

With the -failindex option specified, encoding convertto encodes as much as it can and returns the index of the character that could not be encoded.

% list [encoding convertto -failindex fidx ascii A\xe9\B] $fidx
A 1
% list [encoding convertto -failindex fidx utf-8 A\uDC00B] $fidx
A 1

5.3 Issues in encoding transforms

5.3.1 Only partial support for conforming error handling behavior

The Unicode standard specifies conforming behaviors for handling invalid byte sequences. Among these the only one supported by Tcl 9 currently is raising of an error exception.

The alternative of replacing invalid bytes with the U+FFFD REPLACEMENT CHARACTER as defined in the standard is not available. This is common and useful functionality expected by many internationalized applications and available in practically all other languages.

The encoding convertfrom command should support the above. The encoding convertto may also need this to deal with the case where an attempt is made to encode code points (surrogates e.g.) that are invalid in the target UTF encoding.

5.3.2 Error handling options are incomplete and inconsistent

Earlier several different cases were listed for invalid byte sequences encountered by the encoding conversion commands. Handling these as two aspects:

Specifying which cases are treated as errors and which are handled with a fallback strategy such as replacing with some fixed or mapped character.
Once a case is to be treated as an error, how is that to be reported to the caller.

The first is controlled through the -strict and -nocomplain options while the latter is handled by -failindex.

The use of two separate options, -strict and -nocomplain, leads to some confusion in terms of semantics when options are used in combination.

It is also the case that for encoding convertto, the -strict operation does not seem to have any effect as the default behavior (in contrast to convertfrom) already has strict semantics.

More important is that adding new handlers, such as character replacement as specified in the Unicode standard or lossless means adding more options. Although it may be possible to add more options such as -replace and -lossless, it would be a cleaner interface to have a single option whose value determines behavior.

TIP 654 proposes profiles which serve this purpose. Specifying error case would then take the form

encoding convertfrom -profile strict utf-8 $s
encoding convertfrom -profile lossless utf-8 $s
encoding convertfrom -profile replace utf-8 $s
...and so on...

and so on. TIP 654 still needs to be flushed out but its model seems to me a better interface than continuing to add mutually exclusive options to control classification of error cases.

Related issues with these options is that they are inconsistently treated depending on conversion direction and encoding in use. These are detailed in other sections.

5.3.3 Default handling of invalid bytes is neither conformant nor consistent

By default, the UTF decoders map invalid bytes to their numerically equal code points. Effectively they assume the encoding is iso8859-1 (or cp1252?).

This is just plain wrong and saying it works that way in Tcl 8 does not make it any less so. It also happens to be non-conformant.

As an example of undesirable consequence, consider the treatment of ZIP archives as discussed on the mailing list. ZIP archives do not contain metadata indicating the encoding used for file paths stored in the archive. Handling of these paths is important in at least two contexts:

They should be displayed correctly to the user or at least an indication that some characters could not be decoded
If the file name is written out (say only the content is changed), the original name should be preserved.

The single byte file name \xe0, when created on a CP1250 system corresponds to the character LATIN SMALL LETTER R WITH ACUTE (U+0155). When Tcl reads it with a (guessed) UTF-8 encoding in default mode, a standalone is invalid so it passes it through as its numerically equivalent code point (U+00E0) resulting in the file name within Tcl being LATIN SMALL LETTER A WITH GRAVE. This is how it will be displayed to the user and how the file will be renamed when written out.

This silent modification of data that is invisible to the application and user is unacceptable in my mind.

For a discussion on this issue, see the mailing list post.

A related issue is that this behavior is not even consistent between encodings as described separately later.

5.3.4 No support for lossless operation

There are specific circumstances where the encoding / decoding are required to support lossless operation even in the presence of invalid bytes. The requirement can be stated as

encode(decode(x)) == x

Neither Tcl’s non-conformant default replacement with numeric equivalent nor the Unicode standard’s replace with U+FFFD behaviors meet this requirement.

Situations where this is important include dealing with file names (which may be any sequence of bytes on Unix) and system interfaces. This is elaborated in Unicode TR #36 3.7 which specifies several acceptable alternatives for achieving this one of which is Python’s PEP 383.

Quoting, With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF.

See the PEP and Option D in Substituting malformed UTF-8 sequences in a decoder for why this is viable.

The Unicode Tech report #36 also semi-blesses this, again with the same caveat that use should be restricted. It also suggests other alternatives for the same purpose.

Note there are some associated caveats as to where this handling is appropriate and would need to be followed.

Fixing this issue requires an option to encoding that specifies this handling of invalid encoded input. Python uses -surrogateescape but if a different mechanism is chosen, something like -lossless might be appropriate. If suggestions related to reworking the -strict, -nocomplain options are accepted, a lossless profile could be implemented.

5.3.5 Default encoder handling should be strict conformance

The default behavior of the UTF encoders is to not force strict conformance. Instead they effectively assume the encoding of the invalid bytes to be ISO8859-1.

This is wrong. Tcl should not be in the business of “guessing” what the encoding was supposed to be unless TIP 131 is implemented.

Loose, “forgiving” behavior leads to latent bugs, security holes etc. and should be explicitly requested by the application, not enabled by default.

The common push back is that this allows better Tcl 8 compatibility and changing would break applications. That breakage is a good thing as it forces those applications to be modified to be correct and robust. Moreover, be it noted that the binary encoding already breaks Tcl 8 compatibility (intentionally and with good reason).

Simply put, strict conformance should be the default.

Note in passing that the manpage does not reflect the current behavior while TIP’s are conflicting. TIP 601 specifies strictness is the default (as seems to be the case for ascii e.g.) while TIP 346 implies it is not (else why have the -strict option)?

5.3.6 `-failindex` does not distinguish errors from incomplete sequences

The -failindex option to encoding convertfrom is intended to allow incremental decoding of a byte sequence as described earlier. However, it does not provide a mechanism to distinguish a hard error (invalid bytes) from soft errors (more data needed for incomplete sequences). This makes it insufficient for use.

Consider trying to decode A\xC2 versus A\xC0. Both will return “A” with a failindex of 1. The former may be decoded successfully through the arrival of more data that is appended. Adding more data to the latter is never going to help; it will continue to fail. But there is no simple way to determine which case applies.

One way to fix this would be to support another option -failcode whose value would be a variable that hold the reason for failure when the -failindex option variable indicated failure. This would allow distinguishing hard errors from soft errors as in the following pseudocode.

while {...more data available...} {
    append bindata [get more data]
    set decoded [encode convertfrom -failindex fidx -failcode fcode
$encoding $bindata]
    if {$fidx != -1} {
        if {$fcode ne "NEEDMOREDATA"} {
            # Hard error
            error "Incoming data is not encoded correctly"
        }
       # Not really an error, just need more data
        set bindata [string range $bindata $fidx end]
    } else {
        set bindata ""
    }
    # Do something with decoded data
    puts $decoded
}

Alternatively, instead of another option, the -failindex semantics could be modified to return the pair containing the failing index and failure code.

This is further discussed in the mailing list thread. One of the posts there suggests an alternative by looping while continuously trying to add data. Aside from the performance cost, in my opinion this is not straightforward for a programmer to have to write.

5.3.7 Inconsistencies in encoding and decoding

The default behavior of encoding conversion commands and their options are not consistent amongst different encodings. It is reasonable that encodings differ in what they consider as “strict”. However,

The programmer should not be surprised by what an encoding chooses to define as strict or loose.
The notion of strictness for an encoding should be uniformly applied.
The defaults should be consistent across all encodings.

The examples below may just be bugs. On the other hand, they may be based on some rationale for an encoding. This needs to be then documented with an explanation.

While UTF encodings are “loose” by default, ASCII (and may be others) are strict.

% encoding convertfrom ascii \xc0
unexpected byte sequence starting at index 0: '\xC0'
% codepoints [encoding convertfrom utf-8 \xc0]
U+0000C0

Since \xC0 is invalid for both ASCII and UTF-8, why is the default behavior different?

Similarly, treatment may differ depending on direction (encoding vs decoding). For example, for surrogates,

% encoding convertto utf-8 \udc00
unexpected character at index 0: 'U+00DC00'
% codepoints [encoding convertfrom utf-8 \xed\xb0\x80]
U+00DC00

This is inconsistent. The default encoding and decoding operations should be symmetric.

Another example of inconsistent behavior:

% codepoints [encoding convertfrom -nocomplain ascii A\xE0]
U+000041 U+0000E0
% codepoints [encoding convertfrom -nocomplain utf-8 A\xE0]
U+000041 U+0000E0
% codepoints [encoding convertfrom -nocomplain shiftjis A\xE0]
U+000041

Invalid bytes are by default (in the current implementation) documented as being mapped to their numeric code point equivalents. The output above is as expected for UTF-8 and ASCII but not ShiftJIS. (Perhaps this is just a run of the mill bug in the ShiftJIS encoder?)

5.3.8 Manpages for `encoding` have errors

The manpages for the encoding command reflect neither the TIP’s nor the implementation. Some descriptions and examples need to be corrected even if the current behavior is persisted with. Hopefully the issues raised above will be that will be moot.

6 Input and Output

With respect to I/O, encoding transforms arise in three contexts:

Reading from channels decodes the raw input byte stream into a Tcl string using the encoding configured for the channels
Conversely, writing to channels encodes Tcl strings into a byte stream
Certain commands that deal with file names, for example open, exec, glob and file implicitly use the system encoding to decode and encode file names passed in or passed to system calls.

In the case of channels, the encoding used by default depends on the channel type. It can be changed with the fconfigure or chan configure command which support the following options related to encodings:

The -encoding option specifies the encoding to be used on the channel for both input and output. As a special case, in addition to the encodings accepted by the encodings command, the option can also take the value binary. This is in effect an encoding that maps code points in the range U+0000 to U+00FF to the corresponding integer values and is used with binary strings.
The -translation option, which is primarily used for line ending configuration but can also set the encoding to binary.
The -strictencoding and -nocomplainencoding options that correspond to the -strict and -nocomplain options to the encoding command have the same effect.

6.1 Input from channels

Use of the -strictencoding and -nocomplainencoding options on a channel has equivalent effect to their encoding command counterparts (and the same issues) and do not need further discussion.

The considerations relating to input primarily have to do with how errors are handled for the various combination of commands (read and gets) and I/O modes (blocking vs non-blocking).

First, couple of helper procedures to facilitate experimentation.

proc getfd {content args} {
    set fd [open enctest.tmp wb]
    puts -nonewline $fd $content
    close $fd
    set fd [open enctest.tmp]
    if {[llength $args]} {
        fconfigure $fd {*}$args
    }
    return $fd
}
proc encread {fd args} {
    try {
        set result [read $fd {*}$args]
    } finally {
        close $fd
    }
    codepoints $result
}

6.1.1 Blocking `read`

The read command in blocking mode returns as many characters as were requested or the entire remaining content if the number of characters is not specified. The current default behavior when the file contents are not valid is shown below.

For Cases 1-3 described for the encoding convertfrom command, the behaviour is as expected and analogous to that command. Invalid bytes are simply mapped to the code point with the same numeric value.

% encread [getfd \xC1\x80 -encoding utf-8]; # Case 1
U+0000C1 U+000080
% encread [getfd \xC0\x80 -encoding utf-8]; # Case 1 - special case
U+000000
% encread [getfd \xED\xA0\x80 -encoding utf-8]; # Case 2 - surrogate
U+00D800
% encread [getfd \xC2\x41 -encoding utf-8]; # Case 3 - invalid trail byte
U+0000C2 U+000041

However, for the fourth case, when the decoded value is outside the code point range, the behavior differs.

% set fd [getfd \x41\x00\x00\x00\x7F\xFF\xFF\x7F\x41 -encoding utf-32le]
file1bdff730a78
% read $fd
A
% read $fd
error reading "file1bdff730a78": illegal byte sequence

Notice the first read returned all characters until the error offset and did not raise an error. It was only the second read that generated the error. This is completely in violation of read semantics.

This behavior extends to all scenarios where an error is to be reported. So the -strictencoding option which triggers exceptions instead of mapping invalid bytes by their numeric values also exhibits this.

% set fd [getfd A\x80 -encoding utf-8 -strictencoding 1]
file1bdff736878
% read $fd
A
% read $fd
error reading "file1bdff736878": illegal byte sequence
% close $fd

The case where read is supplied the number of characters to read is very similar.

% set fd [getfd A\xC0BC -encoding utf-8 -strictencoding 1]
file1bdff7400f8
% read $fd 2
A
% eof $fd
0
% read $fd 2
error reading "file1bdff7400f8": illegal byte sequence
% close $fd

A successful blocking read of N characters should never return fewer than N characters except under EOF conditions. As seen above, this is not the case.

Since non-strict behavior is the default for UTF encodings, the -nocomplainencoding does not seem to have any effect for those cases. For other encodings however, which default to strict mode, the option results in invalid byte sequences to be mapped to their numerically equal code points.

% set fd [getfd A\xC1\x80B -encoding ascii -nocomplainencoding 1]
file1443423cc08
% encread $fd
U+000041 U+0000C1 U+000080 U+000042

6.1.2 Non-blocking `read`

For the default (non-strict) case, non-blocking reads behave similar to the blocking case, returning the requisite number of characters with invalid byte sequences mapped to their numerically equivalent code points.

The same is true when -strictencoding is specified.

% set fd [getfd A\xC1\x80B -encoding utf-8 -blocking 0 -strictencoding 1]
file14434240608
% read $fd
A
% read $fd
error reading "file14434240608": illegal byte sequence
% close $fd

However, unlike the blocking case, this behavior seems acceptable since a non-blocking read differs from blocking reads in that the semantics permit fewer characters to be returned than requested.

6.1.3 Blocking `gets`

A successful return from gets is supposed to return all characters up to the next line ending character (ignoring end of file cases). Again, for UTF encodings in default (non-strict) mode, behavior is as expected (as opposed to correct!) with each invocation returning a line with invalid characters mapped to their numeric code points.

% set fd [getfd \x61\x0a\x62\xc1B\x0a\x63\x0a -encoding utf-8 -translation lf]
file28ac6619f28
% codepoints [gets $fd]
U+000061
% codepoints [gets $fd]; # \xc1 mapped to U+00C1
U+000062 U+0000C1 U+000042
% codepoints [gets $fd]
U+000063
% close $fd

With the -strictencoding option turned on, the command behaves generates an error.

% set fd [getfd \x61\x0a\x62\xc1B\x0a\x63\x0a -encoding utf-8 -translation lf -strictencoding 1]
file28ac66112a8
% codepoints [gets $fd]
U+000061
% codepoints [gets $fd]
error reading "file28ac66112a8": illegal byte sequence

TODO: Use of -strictencoding causes a hang in some cases.

% set fd [getfd A\xC1B\nC -encoding utf-8 -translation lf -strictencoding 1]
file23a16832bc8
% gets $fd
...hangs...

This seem like just a bug that needs fixing.

6.1.4 Non-blocking `gets`

To be written. I had some inconsistent behavior here depending on the location of the invalid bytes. Need further testing to determine whether it was pilot error or pinpoint the different cases.

6.2 Output on channels

Channel output has the following error cases to consider.

Case 1: The code point cannot be represented in the target encoding.

Case 2: The code point can be represented in the target encoding but is banned by the rules for the encoding.

Case 3: The code point lies outside the valid code point range.

# Helper to write to a file in an encoding and return contents in hex
proc encwrite {s args} {
    set fd [open enctest.tmp w]
    try {
        if {[llength $args]} {
            fconfigure $fd {*}$args
        }
        puts -nonewline $fd $s
    } finally {
        close $fd
    }
    set fd [open enctest.tmp]
    try {
        fconfigure $fd -encoding binary
        set content [read $fd]
    } finally {
        close $fd
    }
    return [binary encode hex $content]
}

Case 1 (code point not representable) cannot occur for UTF-{8,16,32} by definition. For other encodings (some or all?) the default operation results in an error being raised.

(Note: encwrite returns the hex representation of output byte sequence)

% encwrite a\u00e9b -encoding ascii; # U+00E9 not supported in ascii
error writing "file204b5574468": illegal byte sequence

Further, -strictencoding makes no difference as was the case for the encoding convertto command.

% encwrite a\u00e9b -strictencoding 1 -encoding ascii; # U+00E9 not supported in ascii
error writing "file204b5575ce8": illegal byte sequence

Case 2 (code point is representable but must not be) is specific to UTF encodings afaik. In particular, surrogates. An attempt to write surrogates to a UTF-8 encoded channel will fail by default.

% encwrite \uD800 -encoding utf-8
error writing "file28ac6cc9548": illegal byte sequence

Thus, in this case as well, -strictencoding 1 is superfluous as that is the default behavior in any case.

Note this default behavior differs from the channel input handling where by default surrogates are accepted.

However, the -nocomplainencoding option can be used to change this.

% encwrite \uD800 -encoding utf-8 -nocomplainencoding 1
eda080

Handling of Case 3 errors is different again. While for Case 1, an error was raised by default (and thus made -strictencoding superfluous), here the opposite tack is taken. No error is raised by default, and no error is raised in strict mode either. Instead the value is replaced by U+FFFD REPLACEMENT CHARACTER.

% encwrite [teststringobj newunicode 1 0x7fffffff] -encoding utf-32
fdff0000
% encwrite [teststringobj newunicode 1 0x7fffffff] -encoding utf-32 -strictencoding 1
fdff0000

(Here teststringobj newunicode is a hack to create arbitrary code point values via Tcl_NewUnicodeObj).

6.3 Binary channels

There is a special encoding - binary - that can be configured for channels that is not applicable to the encoding command. It is intended to deal with binary strings (described earlier).

When writing to channels configured with the binary encoding, all code points in the argument passed must be in the range U+0000:U+00FF. The channel writes a byte stream where each byte contains the numeric value of the corresponding character in the string.

Passing code points outside this range will generate an error irrespective of the values of -strictencoding and -nocomplainencoding.

% set fd [open tmp.bin wb]
file28ac65fc1a8
% fconfigure $fd -encoding binary
% puts $fd \u0100
error writing "file28ac65fc1a8": illegal byte sequence
% fconfigure $fd -nocomplainencoding 1
% puts $fd \u0100
error writing "file28ac65fc1a8": illegal byte sequence
% close $fd

This differs from Tcl 8 behavior where, instead of generating an error, Tcl would simply ignore the higher bits of the numeric value.

6.4 File paths and system interfaces

There are a number of Tcl commands where encoding transforms are used implicitly when calling system API’s. These include glob, open, exec, file and parsing of command line or environment variables. The transform may require either encoding (passing paths to the system API) or decoding (receiving paths from the system API). In both cases, Tcl assumes the system encoding is in effect.

There is a problem with the above as discussed in the issues.

6.5 Issues in I/O and system interfaces

Issues specific to channels and system interfaces are described below. The issues related to encoding transforms earlier also apply to transforms in I/O with input and output being analogous to convertfrom and convertto respectively. Those issues are not repeated here.

6.5.1 Behavior of read violates defined semantics

A blocking read should never read fewer bytes than requested (or whole file) on a successful return. This is violated in the presence of invalid input bytes.

This has already been described earlier with an example.

See the mailing list thread for an ongoing discussion.

6.5.2 Channel read state after errors

Related to the above is the question of the channel state when an exception is raised on a read or gets due to invalid input bytes.

My opinion is that when an exception is raised, the read position should not be changed from the value at the time the read or gets was invoked. This allows the application to then turn off the strict checking (if so desired) and read in the rest of the data. For example,

% set fd [open x.txt r]
file1de34c88158
% fconfigure $fd -encoding utf-8 -strictencoding 1
% gets $fd
a
% gets $fd
error reading "file1de34c88158": illegal byte sequence
% fconfigure $fd -encoding utf-8 -strictencoding 0
% gets $fd
bÀ
% gets $fd
c

An alternate suggestion made in the mailing list was to move the read position to the location of the invalid byte and return the characters successfully decoded so far in the error options dictionary. My personal opinion is returning successfully read data in the error options is unnatural and non-idiomatic.

For further explanation of the above recommended behavior and alternatives see the mailing list post and the containing thread.

Yet another alternative on encountering invalid input would be to raise an error and then prohibit any further read operations on the channel.

6.5.3 Channel write state after errors

If an exception is raised on a channel due to the channel encoding not supporting the characters, it is not documented whether the characters up to the error location are written to the channel or not. Experimentally discovering this is difficult as it may depend on the buffering model for the channel as well as underlying channel driver.

My preference would be that write (puts) should successfully accept all the data or none.

6.5.4 File and system APIs are not lossless

Tcl’s assumption of the system encoding in system interfaces and file paths has some potential pitfalls. Unix systems permit any arbitrary sequence of bytes to be a file name. This means that irrespective of the encoding used, there may be invalid byte sequences within the name. The currently implemented Tcl 9 behavior of mapping invalid bytes to their numerically equivalent code points can result in silent misbehavior such as renaming of files.

A command like

foreach f [glob *] {rename $f $f.bak}

will not behave as expected if any matched file names are an arbitrary sequence of bytes as permitted by Unix file systems.

Similarly, a command like

set ::env(X) $::env(X)

will not be a no-op.

Fixing this requires adding support for a lossless encoding profile as listed in an earlier issue and modifying the relevant commands such as glob, open etc. to implicitly use that profile. The same would also apply to command line arguments, environment variables etc.

6.5.5 No error raised for conflicting options

Options -nocomplainencoding and -strictencoding are in conflict but no error is raised if they are passed together to fconfigure. This is probably just a bug.