- Implementation of Coding System Priority lists in various locales
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
(2) Default locale
(a) Some Unicode (fixed width; maybe UTF-8, too?) may optionally
be detected by the byte-order-mark magic (if the first two
bytes are 0xFE 0xFF, the file is Unicode text, if 0xFF 0xFE,
it is wrong-endian Unicode; if legal in UTF-8, it would be
0xFE 0xBB 0xBF, either-endian). This is probably an
optimization that should not be on by default yet.
(b) ISO-2022 encodings will be detected as long as they use
explicit designation of all non-ASCII character sets. This
means that many 7-bit ISO-2022 encodings would be detected
(eg, ISO-2022-JP), but EUC-JP and X Compound Text would not,
because they implicitly designate character sets.
N.B. Latin-1 will be detected as binary, as for any Latin-*.
N.B. An explicit ISO-2022 designation is semantically
equivalent to a Content-Type: header. It is more dangerous
because shorter, but I think we should recognize them by
default despite the slight risk; XEmacs is a text editor.
N.B. This is unlikely to be as dangerous as it looks at first
glance. Any file that includes an 8-bit-set byte before the
first valid designation should be detected as binary.
(c) Binary files will be detected (eg, presence of NULs, other
non-whitespace control characters, absurdly long lines, and
presence of bytes >127).
(d) Everything else is ASCII.
(e) Newlines will be detected in text files.
(3) European locales
(a) Unicode may optionally be detected by the byte-order-mark
magic.
(b) ISO-2022 encodings will be detected as long as they use
explicit designation of all non-ASCII character sets.
(c) A locale-specific class of 1-byte character sets (eg,
'(Latin-1)) will be detected.
N.B. The reason for permitting a class is for cases like
Cyrillic where there are both ISO-8859 encodings and
incompatible encodings (KOI-8r) in common use. If you want to
write a Latin-1 v. Latin-2 detector, be my guest, but I don't
think it would be easy or accurate.
(d) Binary files will be detected per (2)(c), except that only
8-bit bytes out of the encoding's range imply binary.
(e) Everything else is ASCII.
(f) Newlines will be detected in text files.
(4) CJK locales
(a) Unicode may optionally be detected by the byte-order-mark
magic.
(b) ISO-2022 encodings will be detected as long as they use
explicit designation of all non-ASCII character sets.
(c) A locale-specific class of multi-byte and wide-character
encodings will be detected.
N.B. No 1-byte character sets (eg, Latin-1) will be detected.
The reason for a class is to allow the Japanese to let Mule do
the work of choosing EUC v. SJIS.
(d) Binary files will be detected per (3)(d).
(e) Everything else is ASCII.
(f) Newlines will be detected in text files.
(5) Unicode and general locales; multilingual use
(a) Hopefully a system general enough to handle (2)--(4) will
handle these, too, but we should watch out for gotchas like
Unicode "plane 14" tags which (I think _both_ Ben and Olivier
will agree) have no place in the internal representation, and
thus must be treated as out-of-band control sequences. I
don't know if all such gotchas will be as easy to dispose of.
(b) An explicit coding system priority list will be provided to
allow multilingual users to autodetect both Shift JIS and Big
5, say, but this ability is not promised by Mule, since it
would involve (eg) heuristics like picking a set of code
points that are frequent in Shift JIS and uncommon in Big 5
and betting that a file containing many characters from that
set is Shift JIS.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
- Better Algorithm, More Flexibility, Different Levels of Certainty
- Much More Flexible Coding System Priority List, per-Language
Environment
- User Ability to Select Encoding when System Unsure or encounters
errors.
- "No Corruption" Scheme for preserving external encoding when
non-invertible transformation applied.
A preliminary and simple implementation is:
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
But you could implement it much more simply and usefully by just
determining, for any text being decoded into mule-internal, can we go
back and read the source again? If not, remember the entire file
(GNUS message, etc) in text properties. Then, implement the UI
interface (like Netscape's) on top of that. This way, you have
something that at least works, but it might be inefficient. All we
would need to do is work on making the underlying implementation more
efficient.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>