- Implementation of Coding System Priority lists in various locales

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
(2) Default locale

    (a) Some Unicode (fixed width; maybe UTF-8, too?) may optionally
        be detected by the byte-order-mark magic (if the first two
        bytes are 0xFE 0xFF, the file is Unicode text, if 0xFF 0xFE,
        it is wrong-endian Unicode; if legal in UTF-8, it would be
        0xFE 0xBB 0xBF, either-endian).  This is probably an
        optimization that should not be on by default yet.

    (b) ISO-2022 encodings will be detected as long as they use
        explicit designation of all non-ASCII character sets.  This
        means that many 7-bit ISO-2022 encodings would be detected
        (eg, ISO-2022-JP), but EUC-JP and X Compound Text would not,
        because they implicitly designate character sets.

        N.B. Latin-1 will be detected as binary, as for any Latin-*.

        N.B. An explicit ISO-2022 designation is semantically
        equivalent to a Content-Type: header.  It is more dangerous
        because shorter, but I think we should recognize them by
        default despite the slight risk; XEmacs is a text editor.

        N.B. This is unlikely to be as dangerous as it looks at first
        glance.  Any file that includes an 8-bit-set byte before the
        first valid designation should be detected as binary.

    (c) Binary files will be detected (eg, presence of NULs, other
        non-whitespace control characters, absurdly long lines, and
        presence of bytes >127).

    (d) Everything else is ASCII.

    (e) Newlines will be detected in text files.

(3) European locales

    (a) Unicode may optionally be detected by the byte-order-mark
        magic.

    (b) ISO-2022 encodings will be detected as long as they use
        explicit designation of all non-ASCII character sets.

    (c) A locale-specific class of 1-byte character sets (eg,
        '(Latin-1)) will be detected.

        N.B.  The reason for permitting a class is for cases like
        Cyrillic where there are both ISO-8859 encodings and
        incompatible encodings (KOI-8r) in common use.  If you want to
        write a Latin-1 v. Latin-2 detector, be my guest, but I don't
        think it would be easy or accurate.

    (d) Binary files will be detected per (2)(c), except that only
        8-bit bytes out of the encoding's range imply binary.

    (e) Everything else is ASCII.

    (f) Newlines will be detected in text files.

(4) CJK locales

    (a) Unicode may optionally be detected by the byte-order-mark
        magic.

    (b) ISO-2022 encodings will be detected as long as they use
        explicit designation of all non-ASCII character sets.

    (c) A locale-specific class of multi-byte and wide-character
        encodings will be detected.
        N.B. No 1-byte character sets (eg, Latin-1) will be detected.
        The reason for a class is to allow the Japanese to let Mule do
        the work of choosing EUC v. SJIS.

    (d) Binary files will be detected per (3)(d).

    (e) Everything else is ASCII.

    (f) Newlines will be detected in text files.

(5) Unicode and general locales; multilingual use

    (a) Hopefully a system general enough to handle (2)--(4) will
        handle these, too, but we should watch out for gotchas like
        Unicode "plane 14" tags which (I think _both_ Ben and Olivier
        will agree) have no place in the internal representation, and
        thus must be treated as out-of-band control sequences.  I
        don't know if all such gotchas will be as easy to dispose of.

    (b) An explicit coding system priority list will be provided to
        allow multilingual users to autodetect both Shift JIS and Big
        5, say, but this ability is not promised by Mule, since it
        would involve (eg) heuristics like picking a set of code
        points that are frequent in Shift JIS and uncommon in Big 5
        and betting that a file containing many characters from that
        set is Shift JIS.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

  - Better Algorithm, More Flexibility, Different Levels of Certainty

  - Much More Flexible Coding System Priority List, per-Language
    Environment

  - User Ability to Select Encoding when System Unsure or encounters
    errors.

  - "No Corruption" Scheme for preserving external encoding when
    non-invertible transformation applied.

    A preliminary and simple implementation is:

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
But you could implement it much more simply and usefully by just
determining, for any text being decoded into mule-internal, can we go
back and read the source again?  If not, remember the entire file
(GNUS message, etc) in text properties.  Then, implement the UI
interface (like Netscape's) on top of that.  This way, you have
something that at least works, but it might be inefficient.  All we
would need to do is work on making the underlying implementation more
efficient.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>