Tcl 9 has significantly expanded support for Unicode characters compared to Tcl 8. The changes are documented across multiple TIP’s and manpages which makes it difficult for some people (meaning me) to get their head around the functionality. This document will eventually consolidate Tcl 9 Unicode related support in one place.
However, at the moment its purpose is to document how the current implementation works and highlight what many of us consider major deficiencies in the API. Although the debate so far has centered around channel I/O and strictness related options, there are other issues, some of which might just be bugs but others are more substantial.
The hope is this will allow folks not following the debates on the mailing list and elsewhere to catch up and express their views.
Note: this area is still under debate and development on multiple branches. What follows is the behavior of the commit fb44cd608e43667eef4beeefa87b81a775470666 on the main branch.
The following definitions from Chapter 3 (PDF) of the Unicode standard are relevant for the discussion below.
D9 Unicode codespace: A range of integers from 0 to 0x10FFFF.
D10 Code point: Any value in the Unicode codespace.
D10a Code point type: Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.
D12 Coded character sequence: An ordered sequence of one or more code points. Note here the work coded does not refer to the encoding transforms like UTF-8 but rather the mapping of abstract characters to integer code points.
D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.
D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange.
D78 Code unit sequence: An ordered sequence of one or more code units.
D80 Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.
Conformant error handling of invalid encoded byte sequences is defined in 3.2 C10, 3.2 D93 and 5.22. These should either raise an error or replace the sequence with the appropriate (defined there) number of U+FFFD REPLACEMENT CHARACTER code points. In addition, as a special case for handling file names and similar which require lossless conversion, Unicode TR #36 3.7 recommends several alternatives including PEP 383.
In Tcl 9, a string at the script level seems to correspond to a sequence of Unicode code points as defined in the standard. In particular,
It is not a sequence of abstract characters as defined in the standard as it allows for code points that must not be interpreted as characters.
It is not a sequence of glyphs or graphemes in the manner a human reader might recognize as characters.
It is not a sequence of Unicode scalar values as the latter does not include values in the high/low surrogage range.
It is not a Unicode string as defined in the standard as that is defined in terms of encoding forms which in turn are defined in terms of Unicode scalar values.
However, while at the script level strings can only
be constructed as a a sequence of Unicode code points in the range
U+0:U+10FFFF, at the C level Tcl_NewUnicodeObj
etc. allow a
Tcl string to contain values outside that range. It is not clear if this
is legal or whether no check is made because of the performance cost.
This leads to inconsistent internal Tcl_Obj
structures.
There are multiple ways to include non-ASCII code points in string
literals in ASCII program text. With X
denoting a
hexadecimal digit,
\xXX
for code points in the range
0-0xFF
.
\uXXXX
for code points in the range
0-0xFFFF
.
\UXXXXXXXX
for code points in the range
0-0x10FFFF
.
Note that as the documentation of the \U
form states,
up to eight hex digits may be specified but the parsing of
digits stops if the resulting value would exceed 0x10FFFF
.
So for example,
% scan \U10FFFFFF %c%c%c
1114111 70 70
It is not possible to generate code points above 0x10FFFF using this notation. That is assumed intentional as values in that range are not actually valid code points.
Binary strings in Tcl 9 are simply coded character sequences (Tcl
strings) where each code point is in the range 0-255
. The
Tcl binary
command operates on such strings as though they
were binary data (sequence of bytes).
Note there is a change from Tcl 8 in the behavior of the
binary
command when its operand contains code points above
255. In Tcl 8, the higher order bits would be ignored.
% package require Tcl
8.6.13
% binary encode hex \UFF
ff
% binary encode hex \U100
00
In Tcl 9, an error is raised.
% package require Tcl
9.0a4
% binary encode hex \UFF
ff
% binary encode hex \U100
expected byte sequence but character 0 was 'Ā' (U+000100)
I have not found any manpage or TIP that documents what constitutes a Tcl string. As discussed earlier, it appears from behavior that a Tcl string is a sequence of code points but that should be explicitly documented else it leads to confusion as to whether surrogate code points may be present in the string, whether out of range code points are allowed etc. This leads to the issue described in the following section.
On a related but lesser note, the use of the term Unicode string in the Tcl manpages is not even remotely similar to the definition of Unicode string in the Unicode standard (see earlier).
My preference would be to define strings as sequences of Unicode code points and change all mentions of Unicode strings to just simply strings.
Along the same lines, the Tcl man pages use the term character when they should really use code point. I’m ambivalent as to whether this should be changed to be more accurate as character can be interpreted multiple ways, none of which match what Tcl considers a character. On the hand, for most readers, the term character is more natural.
Irrespective of whether values outside U+0:U+10FFFF are legal or not,
it is important that a Tcl_Obj
object be consistent in its
internal structure. Currently this is not the case. Values outside that
range (for example 0x7FFFFFFF
) result in a
Tcl_Obj
whose bytes
field contains the UTF-8
sequence \xef\xbf\xbd
(corresponding to U+FFFD) while the
structure’s internal representation String.unicode
contains
the original value (0x7FFFFFFF
). This inconsistency
implies the result of operations on that object would depend on whether
the implementation used the byte representation or the internal
representation. This is not a good thing. The following
illustrates potential for anomalous behavior
(teststringobj newunicode
maps to
Tcl_NewUnicodeObj
for testing purposes):
% set u \uFFFD
% set c [teststringobj newunicode 1 0x7fffff7f]
% regexp $u $c
0
% string equal $u $c
1
The above is because regexp
works with the
String.unicode
internal representation while
string equal
uses the bytes
field.
Furthermore, Tcl internally combines flags with the code point
values, e.g. the value 0x1000001
passed in via
Tcl_NewUnicodeObj
will interpreted as the flag
TCL_COMBINE
(defined as 0x1000000
) or-ed with
the code point U+0001 when passed to internal Tcl encoding functions.
This misinterpretation, confusing data bits passed in as internal flags,
also makes me very uncomfortable though I’m not sure what the
ramifications are.
Assuming integer values above 0x10FFFF are in fact illegal, fixing this by adding checks to the C API and returning errors does not seem feasible:
There would be a performance hit for large strings.
Even if the performance hit was tolerable, semantics of some of the C APIs do not allow for failures and changing that would seriously break compatibility.
Alternatives are to
Change Tcl string semantics to allow for code points above U+10FFFF. Tcl’s internal UTF-8 encoding would need to change to allow for 6 byte UTF-8 sequences. Note this does not mean Tcl’s external encoders would treat these code points as legal.
Alternatively, replace the out of range code points with U+FFFD
REPLACEMENT CHARACTER and document the C API accordingly. This is in
effect what the current implementation does but only partially as
described above. The implementation would have to change to fix up the
String
representation as well. This could be done either at
the time the C API’s are called, or if that is considered too
detrimental to performance, lazily when encountered in string operations
that use the String.unicode
representation.
The Tcl manpage states
The range U+00D800–U+00DFFF is reserved for surrogates, which are illegal on their own. Therefore, such sequences will result in the replacement character U+FFFD.
This is not how the implementation behaves.
% format %x [scan \U00D800 %c]
d800
My assumption is the manpage is wrong and needs to be corrected as Tcl strings permit inclusion of values in the surrogate range.
The \U
and \u
escape sequences have
variable lengths. This is probably by design as in Tcl 8 and not likely
to be changed. I am listing it here because I find it confusing from a
readability perspective. As an example, guess the result of
string length \UABCDEF
As a general rule, Tcl’s string related commands
(string
, regexp
) are “Unicode-aware”. However,
they operate on sequences of zero or more code points, i.e. coded
character sequences as defined above and in the Unicode standard.
Tcl does not operate on abstract characters as defined in the Unicode
standard or on glyphs or graphemes which is how humans
would recognize as characters.
Two coded characters sequences that represent the same abstract
character will not be treated as equal. For example, the abstract
character e
with acute may be represented by the
precomposed code point U+00E9
(Latin Small Letter E with Acute
) or as the decomposed form
U+0065, U+0301 (Latin Small Letter E
followed by
Combining Acute Accent
). These are treated as different
strings in Tcl even though they represent the same grapheme and would be
considered the same by a human assuming a competent display driver.
This behavior is reflected through all Tcl commands and lead to what
might seem to be anomalous behavior. The string length
command would return 2
for the length of
\u0065\u0301
though the display would only show a single
character. Similarly, string index
would return U+0065
(e
) while the user may expect to see é
based
on what is seen in the wish
console.
String comparisons and sorting is done by comparing the numeric
values of the code points in each coded character sequence and not using
any locale information. Thus when sorting strings, the letter
f
would sort before the precomposed
character U+00e9 (\u00e9
) but after the
equivalent U+0065, U+0301 sequence. This can lead to visually surprising
results, for example, when displaying a sorted list of file names.
TODO Does the lsort -dictionary
option
understand character case and digits outside of ASCII?
Tcl’s string is
command has several subcommands for
character classes, such as isspace
, isdigit
.
These are Unicode aware. For example, for U+0967, 1
in the
Devanagiri script,
% string is digit \u0967
1
There is one new classification command in Tcl 9,
string is unicode
.
Experimentally, it appears that the term refers to any code point other than those in the surrogate and noncharacter categories as defined in the Unicode standard.
% string is unicode \uD800; # Surrogates
0
% string is unicode \uFDD0; # Noncharacter code point
0
% string is unicode \uE000; # Private use
1
% string is unicode \UE4000; # Unassigned/reserved
1
% string is unicode \u001F; # Control
1
% string is unicode \u0020; # Graphic
1
% string is unicode \u200E; # Format
1
Note that string is unicode
cannot be used to check for
abstract characters as it returns 1
for unassigned code
points which are not to be treated as abstract characters per the
standard.
string is unicode
There are a couple of issues with the new
string is unicode
command apart from the fact that it is
not currently documented.
The name of the command is ambiguous and therefore confusing. Does
unicode
refer to Unicode code points, Unicode scalar
values, Unicode strings, or what? Experimentally it appears, as shown
above, that it refers to a subset of Unicode character categories. This
is not at all clear from the name.
Perhaps the command string is unicode
should be renamed
to be string is abstractchar
or
string is char
.
Further, it is not very clear where this command is useful. It has been suggested that this command can be used to check whether a Tcl string value can be conformantly transformed via an UTF encoding for transmission. However, this use does not hold because transmission of noncharacter code points is explicitly permitted by the standard.
TIP 652 discusses this in further detail and suggests changing the command to correspond to the categories as defined in the standard. There has been no discussion on the TIP as yet.
The Unicode standard explicitly warns against interpretation of code
points in categories Surrogate
and
Noncharacter
as characters when working with sequences of
characters. The string
commands operate on code points and
violates this.
Given that Tcl strings have been implicitly defined as sequences of code points and not characters, it is not clear much can be done about this other than documenting that Tcl strings are not strings of characters as defined in the Unicode standard.
When exchanging data between processes, via file, network etc., the
Tcl string values have to transformed to and from a sequence of bytes
using some encoding. This transform may be done either explicitly with
the encoding
command as part of I/O by configuring a
channel with the -encoding
option. This section only
discusses the former. I/O is discussed in a later section.
Relevant definitions from the Unicode standard are
D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange.
D78 Code unit sequence: An ordered sequence of one or more code units.
D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. Note the reference to scalar value which means surrogate code points should never be transformed to any of the UTF encoding forms. (They can appear in a UTF-16 stream as a result of transforming a code point outside the BMP).
The standard defines Unicode encoding forms UTF-8, UTF-16 and UTF-32.
Note however that Tcl supports many “traditional” encoding like
cp1252
as well and any discussion needs to include
these.
The major discussion point with respective to encoding transforms is dealing with error cases:
defining what constitutes an error. This may depend on the command options in effect.
how errors are reported or handled.
The type of errors encountered depends on the operation - encoding versus decoding.
NOTE: Most of the discussion below is related to UTF encoders unless stated otherwise.
The following errors may encountered when decoding an encoded byte stream:
Case 1. The byte value may be one that should never appear in the
specified encoding or at a particular position in a multibyte encoding.
For example, the values \xC0
and \xC1
should
never appear at any point in a UTF-8 encoded byte sequence. As an
example of the latter, the byte \xE0
(amongst others)
should never appear as the lead byte in ShiftJIS.
Case 2. The rules for the encoding do not permit the value to have been encoded in the first place. For example, surrogate Unicode code points should never be encoded and thus should be treated as an error when encountered during a decoding operation. (Note the surrogate could appear in the UTF-16 encoded byte sequence. But the decoded value should never be a surrogate code point.)
Case 3. A byte subsequence within a byte sequence that is encoded
with a multibyte encoding terminates prematurely. This may or may not be
an error depending whether the subsequence is in the middle of the
containing byte sequence or at the end. In the latter case, it may just
mean more bytes are needed as may happen when data is read over a
streaming interface. For example, the UTF-8 sequence
\xC2\x41
is a hard error as there is no trailing byte
succeeding the lead byte \xC2
(\x41
cannot be
a trailing byte). On the other hand, the sequence \x41\xC2
may not be an error because additional data may arrive containing a
valid trailing byte to complete the \xC2
.
Case 4. The decoded values may lie outside the range of Unicode code
points. For example the UTF-32 encoded sequence
\x7F\xFF\xFF\x7F
trivially transates to the integer value
U+7FFFFF7F which is greater than the largest valid code point U+10FFFF.
This is distinguished from the Case 2 because it is treated differently
by Tcl.
How the various error cases above are handled by the
encoding convertfrom
command depends on several
options.
By default, when the encoding is one of the UTF encoders, the command will not raise an error for any case except Case 4 above.
If an invalid byte or byte sequence is detected, it is simply
transformed to the Unicode code point with the same integer value. In
the examples below, \xC0
and \xC1
are invalid
at any position in a UTF-8 byte sequence while \x80
is
valid only as a trailing byte preceded by a valid lead byte.
Surrogates are happily accepted, though again they should not appear as encoded form in UTF-8 streams.
Incomplete encoded sequences are treated similar to the first
case. A byte containing \xC2
is lead byte and should have a
valid trailing byte following it.
The only time an error is raised with the default behavior is when the decoded value lies outside the range of Unicode code points.
Note the above constitutes non-conformant behavior.
Examples of the above error cases:
proc codepoints {s} {join [lmap c [split $s ""] {
string cat U+ [format %.6X [scan $c %c]]}]
}
% codepoints [encoding convertfrom utf-8 \xC1\x80]; # Case 1
U+0000C1 U+000080
% codepoints [encoding convertfrom utf-8 \xC0\x81]; # Case 1
U+0000C0 U+000081
% codepoints [encoding convertfrom utf-8 \xC0\x80]; # Case 1 - Special handling!
U+000000
% codepoints [encoding convertfrom utf-8 \xed\xa0\x80]; # Case 2 - surrogate
U+00D800
% codepoints [encoding convertfrom utf-8 \xC2\x81]; # Case 3 - \xC2 followed by valid trail byte (valid UTF-8)
U+000081
% codepoints [encoding convertfrom utf-8 \xC2\x41]; # Case 3 - \xC2 followed by invalid trail byte
U+0000C2 U+000041
% codepoints [encoding convertfrom utf-8 \xC2]; # Case 3 - \xC2 with nothing following
U+0000C2
% encoding convertfrom utf-32 \x7f\xff\xff\x7f; # Case 4 raises an error
unexpected byte sequence starting at index 0: '\x7F'
Note the special treatment specifically for the pair
\xC0\x80
.
This default behavior not only violates the Unicode standard but also results in unexpected data modification.
-strict option
Strict conformance to the Unicode standard can be enforced with the
-strict
option. In this case all the error cases raise
exceptions.
% encoding convertfrom -strict utf-8 \xC0\x81; # Case 1
unexpected byte sequence starting at index 0: '\xC0'
% encoding convertfrom -strict utf-8 \xC0\x81; # No special handling for \xC0\x80
unexpected byte sequence starting at index 0: '\xC0'
% encoding convertfrom -strict utf-8 \xed\xa0\x80; # Case 2 - surrogate
unexpected byte sequence starting at index 0: '\xED'
% encoding convertfrom -strict utf-8 \xC2\x41; # Case 3 - invalid trail byte
unexpected byte sequence starting at index 0: '\xC2'
% encoding convertfrom -strict utf-8 \xC2; # Case 3 - premature termination
unexpected byte sequence starting at index 0: '\xC2'
% encoding convertfrom -strict utf-32 \x7f\xff\xff\x7f; # Case 4 raises an error as before
unexpected byte sequence starting at index 0: '\x7F'
As far as I know, use of -strict
makes the
encoding convertfrom
command conformant with the Unicode
standard.
-nocomplain option
The final option that affects whether invalidly encoded bytes are
treated as errors is the -nocomplain
. This is the opposite
of -strict
in the sense that it never raises an exception.
In particular, it differs from the default behavior in handling of Case
4 above. When the decoded value is outside the Unicode code point range,
instead of raising an error, it replaces the value with the
U+FFFD REPLACEMENT CHARACTER
as defined in the Unicode
standard.
% codepoints [encoding convertfrom -nocomplain utf-32 \x7f\xff\xff\x7f]
U+00FFFD
-failindex option
The failindex
option differs from the
-strict
and -nocomplain
options in that it
does not change what is considered an error but rather changes the error
reporting mechanism. When this option is specified, on detecting an
error encoding convertfrom
will return the string
successfully decoded up to the failing offset. Additionally, the failing
offset will be returned in the variable passed as the
-failindex
option value.
% encoding convertfrom utf-32le \x41\x00\x00\x00\x7f\xff\xff\x7f
unexpected byte sequence starting at index 4: '\x7F'
% encoding convertfrom -failindex fidx utf-32le \x41\x00\x00\x00\x7f\xff\xff\x7f
A
% set fidx
4
The use case for this option is incremental decoding of binary data that is fragmented (for example coming over a socket). A toy example would be
% foreach part [list \x41\x00\00\x00\x42 \x00\x00\x00] {
append remaining $part
puts [encoding convertfrom -failindex fidx utf-32le $remaining]
if {$fidx == -1} {
set remaining ""
} else {
set remaining [string range $remaining $fidx end]
}
}
A
B
Use of this option as above has a significant issue in that there is no distinction between a failure due to an invalid byte value (hard error) versus a failure because the complete binary stream is not available (more data needed).
Going in the other direction, converting Tcl strings to a encoded byte sequence, also has potential for errors.
Case 1. The encoding does not support the Unicode code point. For examples, code points higher than U+00FF are not supported in the ASCII encoding.
Case 2. The encoding may be able to encode a Unicode code point but
the rules for the encoding do not allow it. For example, the Unicode
standard for UTF-8 encoding prohibits encoding of surrogate code points.
So although the surrogate U+DC00 can be encoded as the byte sequence
\xED\xB0\x80
, it is prohibited by the standard.
Case 3: The value of the code point lies outside the valid code point range.
Tcl strings are transformed to binary strings (byte sequences) with
the encoding convertto
command.
The command will by default raise an error for both of the above cases.
% encoding convertto ascii A\u00e9B; # Case 1
unexpected character at index 1: 'U+0000E9'
% encoding convertto ascii \uDC00; # Case 2
unexpected character at index 0: 'U+00DC00'
% encoding convertto utf-8 \uDC00; # Case 2
unexpected character at index 0: 'U+00DC00'
However, no error is raised for Case 3 for UTF encodings (only!). U+FFFD is substituted instead.
% binary encode hex [encoding convertto ascii A[teststringobj newunicode 1 0x7fffff7f]B]; # Case 3
unexpected character at index 1: 'U+00FFFD'
% binary encode hex [encoding convertto utf-8 A[teststringobj newunicode 1 0x7fffff7f]B]; # Case 3
41efbfbd42
These discrepancies between Case 1/2 vs Case 3 as well as between ASCII and UTF-8 for Case 3 are by design for some reason?
It is not clear why the encoding convertto
and
encoding convertfrom
commands differ in that the former
raises an error by default and the latter does not.
-strict option
Although, the encoding convertto
command has the
-strict
option like encoding convertfrom
, it
is not clear what effect it has. It seems to have no difference from the
default behavior.
% encoding convertto -strict ascii A\u00e9B; # Case 1
unexpected character at index 1: 'U+0000E9'
% encoding convertto -strict ascii \uDC00; # Case 2
unexpected character at index 0: 'U+00DC00'
% encoding convertto -strict utf-8 \uDC00; # Case 2
unexpected character at index 0: 'U+00DC00'
-nocomplain option
The -nocomplain
option turns off all raising of
exceptions. For ASCII, and probably all non-UTF encodings, this results
in in the code point being replaced by an encoding-specific character,
generally the question mark.
% encoding convertto -nocomplain ascii A\xe9B
A?B
% encoding convertto -nocomplain ascii A\uDC00B
A?B
% encoding convertto -nocomplain ascii A[teststringobj newunicode 1 0x7fffff7f]B
A?B
% encoding convertto -nocomplain shiftjis A[teststringobj newunicode 1 0x7fffff7f]B
A?B
For UTF encodings, behavior is different. For Case 2 (Case 1 is not possible) the surrogate is output in encoded form. For Case 3, the Unicode U+FFFD REPLACEMENT CHARACTER is used to replace the invalid code point as for the other encodings.
% binary encode hex [encoding convertto -nocomplain utf-8 A\uDC00B]; # Case 2
41edb08042
% binary encode hex [encoding convertto -nocomplain utf-32 A\uDC00B]; # Case 2
4100000000dc000042000000
% binary encode hex [encoding convertto -nocomplain utf-8 A[teststringobj newunicode 1 0x7fffff7f]B]; # Case 3
41efbfbd42
% binary encode hex [encoding convertto -nocomplain utf-32 A[teststringobj newunicode 1 0x7fffff7f]B]; # Case 3
41000000fdff000042000000
The resulting UTF byte sequence will be conformant for Case 3, but not Case 2.
-failindex option
With the -failindex
option specified,
encoding convertto
encodes as much as it can and returns
the index of the character that could not be encoded.
% list [encoding convertto -failindex fidx ascii A\xe9\B] $fidx
A 1
% list [encoding convertto -failindex fidx utf-8 A\uDC00B] $fidx
A 1
The Unicode standard specifies conforming behaviors for handling invalid byte sequences. Among these the only one supported by Tcl 9 currently is raising of an error exception.
The alternative of replacing invalid bytes with the U+FFFD REPLACEMENT CHARACTER as defined in the standard is not available. This is common and useful functionality expected by many internationalized applications and available in practically all other languages.
The encoding convertfrom
command should support the
above. The encoding convertto
may also need this to deal
with the case where an attempt is made to encode code points (surrogates
e.g.) that are invalid in the target UTF encoding.
Earlier several different cases were listed for invalid byte
sequences encountered by the encoding
conversion commands.
Handling these as two aspects:
Specifying which cases are treated as errors and which are handled with a fallback strategy such as replacing with some fixed or mapped character.
Once a case is to be treated as an error, how is that to be reported to the caller.
The first is controlled through the -strict
and
-nocomplain
options while the latter is handled by
-failindex
.
The use of two separate options, -strict
and
-nocomplain
, leads to some confusion
in terms of semantics when options are used in combination.
It is also the case that for encoding convertto
, the
-strict
operation does not seem to have any effect as the
default behavior (in contrast to convertfrom
) already has
strict semantics.
More important is that adding new handlers, such as character
replacement as specified in the Unicode standard or lossless means
adding more options. Although it may be possible to add more options
such as -replace
and -lossless
, it would be a
cleaner interface to have a single option whose value determines
behavior.
TIP 654 proposes profiles which serve this purpose. Specifying error case would then take the form
encoding convertfrom -profile strict utf-8 $s
encoding convertfrom -profile lossless utf-8 $s
encoding convertfrom -profile replace utf-8 $s
...and so on...
and so on. TIP 654 still needs to be flushed out but its model seems to me a better interface than continuing to add mutually exclusive options to control classification of error cases.
Related issues with these options is that they are inconsistently treated depending on conversion direction and encoding in use. These are detailed in other sections.
By default, the UTF decoders map invalid bytes to their numerically equal code points. Effectively they assume the encoding is iso8859-1 (or cp1252?).
This is just plain wrong and saying it works that way in Tcl 8 does not make it any less so. It also happens to be non-conformant.
As an example of undesirable consequence, consider the treatment of ZIP archives as discussed on the mailing list. ZIP archives do not contain metadata indicating the encoding used for file paths stored in the archive. Handling of these paths is important in at least two contexts:
They should be displayed correctly to the user or at least an indication that some characters could not be decoded
If the file name is written out (say only the content is changed), the original name should be preserved.
The single byte file name \xe0
, when created on a CP1250
system corresponds to the character LATIN SMALL LETTER R WITH ACUTE
(U+0155). When Tcl reads it with a (guessed) UTF-8 encoding in default
mode, a standalone is invalid so it passes it through as its numerically
equivalent code point (U+00E0) resulting in the file name within Tcl
being LATIN SMALL LETTER A WITH GRAVE. This is how it will be displayed
to the user and how the file will be renamed when written out.
This silent modification of data that is invisible to the application and user is unacceptable in my mind.
For a discussion on this issue, see the mailing list post.
A related issue is that this behavior is not even consistent between encodings as described separately later.
There are specific circumstances where the encoding / decoding are required to support lossless operation even in the presence of invalid bytes. The requirement can be stated as
encode(decode(x)) == x
Neither Tcl’s non-conformant default replacement with numeric equivalent nor the Unicode standard’s replace with U+FFFD behaviors meet this requirement.
Situations where this is important include dealing with file names (which may be any sequence of bytes on Unix) and system interfaces. This is elaborated in Unicode TR #36 3.7 which specifies several acceptable alternatives for achieving this one of which is Python’s PEP 383.
Quoting, With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF.
See the PEP and Option D in Substituting malformed UTF-8 sequences in a decoder for why this is viable.
The Unicode Tech report #36 also semi-blesses this, again with the same caveat that use should be restricted. It also suggests other alternatives for the same purpose.
Note there are some associated caveats as to where this handling is appropriate and would need to be followed.
Fixing this issue requires an option to encoding
that
specifies this handling of invalid encoded input. Python uses
-surrogateescape
but if a different mechanism is chosen,
something like -lossless
might be appropriate. If
suggestions related to reworking the -strict
,
-nocomplain
options are accepted, a lossless
profile could be implemented.
The default behavior of the UTF encoders is to not force strict conformance. Instead they effectively assume the encoding of the invalid bytes to be ISO8859-1.
This is wrong. Tcl should not be in the business of “guessing” what the encoding was supposed to be unless TIP 131 is implemented.
Loose, “forgiving” behavior leads to latent bugs, security holes etc. and should be explicitly requested by the application, not enabled by default.
The common push back is that this allows better Tcl 8 compatibility
and changing would break applications. That breakage is a good thing as
it forces those applications to be modified to be correct and robust.
Moreover, be it noted that the binary
encoding already
breaks Tcl 8 compatibility (intentionally and with good reason).
Simply put, strict conformance should be the default.
Note in passing that the manpage does not reflect the current behavior while TIP’s are conflicting. TIP 601 specifies strictness is the default (as seems to be the case for ascii e.g.) while TIP 346 implies it is not (else why have the -strict option)?
-failindex
does
not distinguish errors from incomplete sequencesThe -failindex
option to
encoding convertfrom
is intended to allow incremental
decoding of a byte sequence as described earlier. However, it does not
provide a mechanism to distinguish a hard error (invalid bytes) from
soft errors (more data needed for incomplete sequences). This makes it
insufficient for use.
Consider trying to decode A\xC2
versus
A\xC0
. Both will return “A” with a failindex of 1. The
former may be decoded successfully through the arrival of more data that
is appended. Adding more data to the latter is never going to help; it
will continue to fail. But there is no simple way to determine which
case applies.
One way to fix this would be to support another option
-failcode
whose value would be a variable that hold the
reason for failure when the -failindex
option variable
indicated failure. This would allow distinguishing hard errors from soft
errors as in the following pseudocode.
while {...more data available...} {
append bindata [get more data]
set decoded [encode convertfrom -failindex fidx -failcode fcode
$encoding $bindata]
if {$fidx != -1} {
if {$fcode ne "NEEDMOREDATA"} {
# Hard error
error "Incoming data is not encoded correctly"
}
# Not really an error, just need more data
set bindata [string range $bindata $fidx end]
} else {
set bindata ""
}
# Do something with decoded data
puts $decoded
}
Alternatively, instead of another option, the -failindex
semantics could be modified to return the pair containing the failing
index and failure code.
This is further discussed in the mailing list thread. One of the posts there suggests an alternative by looping while continuously trying to add data. Aside from the performance cost, in my opinion this is not straightforward for a programmer to have to write.
The default behavior of encoding conversion commands and their options are not consistent amongst different encodings. It is reasonable that encodings differ in what they consider as “strict”. However,
The programmer should not be surprised by what an encoding chooses to define as strict or loose.
The notion of strictness for an encoding should be uniformly applied.
The defaults should be consistent across all encodings.
The examples below may just be bugs. On the other hand, they may be based on some rationale for an encoding. This needs to be then documented with an explanation.
While UTF encodings are “loose” by default, ASCII (and may be others) are strict.
% encoding convertfrom ascii \xc0
unexpected byte sequence starting at index 0: '\xC0'
% codepoints [encoding convertfrom utf-8 \xc0]
U+0000C0
Since \xC0
is invalid for both ASCII and UTF-8, why is
the default behavior different?
Similarly, treatment may differ depending on direction (encoding vs decoding). For example, for surrogates,
% encoding convertto utf-8 \udc00
unexpected character at index 0: 'U+00DC00'
% codepoints [encoding convertfrom utf-8 \xed\xb0\x80]
U+00DC00
This is inconsistent. The default encoding and decoding operations should be symmetric.
Another example of inconsistent behavior:
% codepoints [encoding convertfrom -nocomplain ascii A\xE0]
U+000041 U+0000E0
% codepoints [encoding convertfrom -nocomplain utf-8 A\xE0]
U+000041 U+0000E0
% codepoints [encoding convertfrom -nocomplain shiftjis A\xE0]
U+000041
Invalid bytes are by default (in the current implementation) documented as being mapped to their numeric code point equivalents. The output above is as expected for UTF-8 and ASCII but not ShiftJIS. (Perhaps this is just a run of the mill bug in the ShiftJIS encoder?)
encoding
have errorsThe manpages for the encoding
command reflect neither
the TIP’s nor the implementation. Some descriptions and examples need to
be corrected even if the current behavior is persisted with. Hopefully
the issues raised above will be that will be moot.
With respect to I/O, encoding transforms arise in three contexts:
Reading from channels decodes the raw input byte stream into a Tcl string using the encoding configured for the channels
Conversely, writing to channels encodes Tcl strings into a byte stream
Certain commands that deal with file names, for example
open
, exec
, glob
and
file
implicitly use the system encoding to decode and
encode file names passed in or passed to system calls.
In the case of channels, the encoding used by default depends on the
channel type. It can be changed with the fconfigure
or
chan configure
command which support the following options
related to encodings:
The -encoding
option specifies the encoding to be
used on the channel for both input and output. As a special case, in
addition to the encodings accepted by the encodings
command, the option can also take the value binary
. This is
in effect an encoding that maps code points in the range U+0000 to
U+00FF to the corresponding integer values and is used with binary
strings.
The -translation
option, which is primarily used for
line ending configuration but can also set the encoding to
binary
.
The -strictencoding
and
-nocomplainencoding
options that correspond to the
-strict
and -nocomplain
options to the
encoding
command have the same effect.
Use of the -strictencoding
and
-nocomplainencoding
options on a channel has equivalent
effect to their encoding
command counterparts (and the same
issues) and do not need further discussion.
The considerations relating to input primarily have to do with how
errors are handled for the various combination of commands
(read
and gets
) and I/O modes (blocking vs
non-blocking).
First, couple of helper procedures to facilitate experimentation.
proc getfd {content args} {
set fd [open enctest.tmp wb]
puts -nonewline $fd $content
close $fd
set fd [open enctest.tmp]
if {[llength $args]} {
fconfigure $fd {*}$args
}
return $fd
}
proc encread {fd args} {
try {
set result [read $fd {*}$args]
} finally {
close $fd
}
codepoints $result
}
read
The read
command in blocking mode returns as many
characters as were requested or the entire remaining content if the
number of characters is not specified. The current default behavior when
the file contents are not valid is shown below.
For Cases 1-3 described for the encoding convertfrom
command, the behaviour is as expected and analogous to that command.
Invalid bytes are simply mapped to the code point with the same numeric
value.
% encread [getfd \xC1\x80 -encoding utf-8]; # Case 1
U+0000C1 U+000080
% encread [getfd \xC0\x80 -encoding utf-8]; # Case 1 - special case
U+000000
% encread [getfd \xED\xA0\x80 -encoding utf-8]; # Case 2 - surrogate
U+00D800
% encread [getfd \xC2\x41 -encoding utf-8]; # Case 3 - invalid trail byte
U+0000C2 U+000041
However, for the fourth case, when the decoded value is outside the code point range, the behavior differs.
% set fd [getfd \x41\x00\x00\x00\x7F\xFF\xFF\x7F\x41 -encoding utf-32le]
file1bdff730a78
% read $fd
A
% read $fd
error reading "file1bdff730a78": illegal byte sequence
Notice the first read
returned all characters until the
error offset and did not raise an error. It was only the second
read
that generated the error. This is completely in
violation of read semantics.
This behavior extends to all scenarios where an error is to be
reported. So the -strictencoding
option which triggers
exceptions instead of mapping invalid bytes by their numeric values also
exhibits this.
% set fd [getfd A\x80 -encoding utf-8 -strictencoding 1]
file1bdff736878
% read $fd
A
% read $fd
error reading "file1bdff736878": illegal byte sequence
% close $fd
The case where read
is supplied the number of characters
to read is very similar.
% set fd [getfd A\xC0BC -encoding utf-8 -strictencoding 1]
file1bdff7400f8
% read $fd 2
A
% eof $fd
0
% read $fd 2
error reading "file1bdff7400f8": illegal byte sequence
% close $fd
A successful blocking read of N characters should never return fewer than N characters except under EOF conditions. As seen above, this is not the case.
Since non-strict behavior is the default for UTF encodings, the
-nocomplainencoding
does not seem to have any effect for
those cases. For other encodings however, which default to strict mode,
the option results in invalid byte sequences to be mapped to their
numerically equal code points.
% set fd [getfd A\xC1\x80B -encoding ascii -nocomplainencoding 1]
file1443423cc08
% encread $fd
U+000041 U+0000C1 U+000080 U+000042
read
For the default (non-strict) case, non-blocking reads behave similar to the blocking case, returning the requisite number of characters with invalid byte sequences mapped to their numerically equivalent code points.
The same is true when -strictencoding
is specified.
% set fd [getfd A\xC1\x80B -encoding utf-8 -blocking 0 -strictencoding 1]
file14434240608
% read $fd
A
% read $fd
error reading "file14434240608": illegal byte sequence
% close $fd
However, unlike the blocking case, this behavior seems acceptable since a non-blocking read differs from blocking reads in that the semantics permit fewer characters to be returned than requested.
gets
A successful return from gets
is supposed to return all
characters up to the next line ending character (ignoring end of file
cases). Again, for UTF encodings in default (non-strict) mode, behavior
is as expected (as opposed to
correct!) with each invocation returning a line with
invalid characters mapped to their numeric code points.
% set fd [getfd \x61\x0a\x62\xc1B\x0a\x63\x0a -encoding utf-8 -translation lf]
file28ac6619f28
% codepoints [gets $fd]
U+000061
% codepoints [gets $fd]; # \xc1 mapped to U+00C1
U+000062 U+0000C1 U+000042
% codepoints [gets $fd]
U+000063
% close $fd
With the -strictencoding
option turned on, the command
behaves generates an error.
% set fd [getfd \x61\x0a\x62\xc1B\x0a\x63\x0a -encoding utf-8 -translation lf -strictencoding 1]
file28ac66112a8
% codepoints [gets $fd]
U+000061
% codepoints [gets $fd]
error reading "file28ac66112a8": illegal byte sequence
TODO: Use of -strictencoding
causes a hang in some
cases.
% set fd [getfd A\xC1B\nC -encoding utf-8 -translation lf -strictencoding 1]
file23a16832bc8
% gets $fd
...hangs...
This seem like just a bug that needs fixing.
gets
To be written. I had some inconsistent behavior here depending on the location of the invalid bytes. Need further testing to determine whether it was pilot error or pinpoint the different cases.
Channel output has the following error cases to consider.
Case 1: The code point cannot be represented in the target encoding.
Case 2: The code point can be represented in the target encoding but is banned by the rules for the encoding.
Case 3: The code point lies outside the valid code point range.
# Helper to write to a file in an encoding and return contents in hex
proc encwrite {s args} {
set fd [open enctest.tmp w]
try {
if {[llength $args]} {
fconfigure $fd {*}$args
}
puts -nonewline $fd $s
} finally {
close $fd
}
set fd [open enctest.tmp]
try {
fconfigure $fd -encoding binary
set content [read $fd]
} finally {
close $fd
}
return [binary encode hex $content]
}
Case 1 (code point not representable) cannot occur for UTF-{8,16,32} by definition. For other encodings (some or all?) the default operation results in an error being raised.
(Note: encwrite returns the hex representation of output byte sequence)
% encwrite a\u00e9b -encoding ascii; # U+00E9 not supported in ascii
error writing "file204b5574468": illegal byte sequence
Further, -strictencoding
makes no difference as was the
case for the encoding convertto
command.
% encwrite a\u00e9b -strictencoding 1 -encoding ascii; # U+00E9 not supported in ascii
error writing "file204b5575ce8": illegal byte sequence
Case 2 (code point is representable but must not be) is specific to UTF encodings afaik. In particular, surrogates. An attempt to write surrogates to a UTF-8 encoded channel will fail by default.
% encwrite \uD800 -encoding utf-8
error writing "file28ac6cc9548": illegal byte sequence
Thus, in this case as well, -strictencoding 1
is
superfluous as that is the default behavior in any case.
Note this default behavior differs from the channel input handling where by default surrogates are accepted.
However, the -nocomplainencoding
option can be used to
change this.
% encwrite \uD800 -encoding utf-8 -nocomplainencoding 1
eda080
Handling of Case 3 errors is different again. While for Case 1, an
error was raised by default (and thus made -strictencoding
superfluous), here the opposite tack is taken. No error is raised by
default, and no error is raised in strict mode either. Instead the value
is replaced by U+FFFD REPLACEMENT CHARACTER.
% encwrite [teststringobj newunicode 1 0x7fffffff] -encoding utf-32
fdff0000
% encwrite [teststringobj newunicode 1 0x7fffffff] -encoding utf-32 -strictencoding 1
fdff0000
(Here teststringobj newunicode is a hack to create arbitrary code
point values via Tcl_NewUnicodeObj
).
There is a special encoding - binary
- that can be
configured for channels that is not applicable to the
encoding
command. It is intended to deal with binary
strings (described earlier).
When writing to channels configured with the binary encoding, all code points in the argument passed must be in the range U+0000:U+00FF. The channel writes a byte stream where each byte contains the numeric value of the corresponding character in the string.
Passing code points outside this range will generate an error
irrespective of the values of -strictencoding
and
-nocomplainencoding
.
% set fd [open tmp.bin wb]
file28ac65fc1a8
% fconfigure $fd -encoding binary
% puts $fd \u0100
error writing "file28ac65fc1a8": illegal byte sequence
% fconfigure $fd -nocomplainencoding 1
% puts $fd \u0100
error writing "file28ac65fc1a8": illegal byte sequence
% close $fd
This differs from Tcl 8 behavior where, instead of generating an error, Tcl would simply ignore the higher bits of the numeric value.
There are a number of Tcl commands where encoding transforms are used
implicitly when calling system API’s. These include glob
,
open
, exec
, file
and parsing of
command line or environment variables. The transform may require either
encoding (passing paths to the system API) or decoding (receiving paths
from the system API). In both cases, Tcl assumes the system encoding is
in effect.
There is a problem with the above as discussed in the issues.
Issues specific to channels and system interfaces are described
below. The issues related to encoding transforms earlier also apply to
transforms in I/O with input and output being analogous to
convertfrom
and convertto
respectively. Those
issues are not repeated here.
A blocking read should never read fewer bytes than requested (or whole file) on a successful return. This is violated in the presence of invalid input bytes.
This has already been described earlier with an example.
See the mailing list thread for an ongoing discussion.
Related to the above is the question of the channel state when an
exception is raised on a read
or gets
due to
invalid input bytes.
My opinion is that when an exception is raised, the read position
should not be changed from the value at the time the read
or gets
was invoked. This allows the application to then
turn off the strict checking (if so desired) and read in the rest of the
data. For example,
% set fd [open x.txt r]
file1de34c88158
% fconfigure $fd -encoding utf-8 -strictencoding 1
% gets $fd
a
% gets $fd
error reading "file1de34c88158": illegal byte sequence
% fconfigure $fd -encoding utf-8 -strictencoding 0
% gets $fd
bÀ
% gets $fd
c
An alternate suggestion made in the mailing list was to move the read position to the location of the invalid byte and return the characters successfully decoded so far in the error options dictionary. My personal opinion is returning successfully read data in the error options is unnatural and non-idiomatic.
For further explanation of the above recommended behavior and alternatives see the mailing list post and the containing thread.
Yet another alternative on encountering invalid input would be to raise an error and then prohibit any further read operations on the channel.
If an exception is raised on a channel due to the channel encoding not supporting the characters, it is not documented whether the characters up to the error location are written to the channel or not. Experimentally discovering this is difficult as it may depend on the buffering model for the channel as well as underlying channel driver.
My preference would be that write (puts
) should
successfully accept all the data or none.
Tcl’s assumption of the system encoding in system interfaces and file paths has some potential pitfalls. Unix systems permit any arbitrary sequence of bytes to be a file name. This means that irrespective of the encoding used, there may be invalid byte sequences within the name. The currently implemented Tcl 9 behavior of mapping invalid bytes to their numerically equivalent code points can result in silent misbehavior such as renaming of files.
A command like
foreach f [glob *] {rename $f $f.bak}
will not behave as expected if any matched file names are an arbitrary sequence of bytes as permitted by Unix file systems.
Similarly, a command like
set ::env(X) $::env(X)
will not be a no-op.
Fixing this requires adding support for a lossless encoding profile
as listed in an earlier issue and modifying the relevant commands such
as glob
, open
etc. to implicitly use that
profile. The same would also apply to command line arguments,
environment variables etc.
Options -nocomplainencoding
and
-strictencoding
are in conflict but no error is raised if
they are passed together to fconfigure
. This is probably
just a bug.
© 2022 Ashok P. Nadkarni