- A more detailed Mule design document for the Internals Manual follows:

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
XEmacs MULE Design Issues

Introduction

This document covers a number of design issues, problems and proposals
with regards to XEmacs MULE.  At first we present some definitions and
some aspects of the design that have been agreed upon.  Then we
present some issues and problems that need to be addressed, and then I
include a proposal of mine to address some of these issues.  When
there are other proposals, for example from Olivier, these will be
appended to the end of this document.

Definitions and Design Basics
=============================

First, text is defined to be a series of characters which together
defines an utterance or partial utterance in some language.
Generally, this language is a human language, but it may also be a
computer language if the computer language uses a representation close
enough to that of human languages for it to also make sense to call its
representation text.  Text is opposed to binary, which is a sequence
of bytes, representing machine-readable but not human-readable data.
A byte is merely a number within a predefined range, which nowadays is
nearly always zero to 255.  A character is a unit of text.  What makes
one character different from another is not always clear-cut.  It is
generally related to the appearance of the character, although perhaps
not any possible appearance of that character, but some sort of ideal
appearance that is assigned to a character.  Whether two characters
that look very similar are actually the same depends on various
factors such as political ones, such as whether the characters are
used to mean similar sorts of things, or behave similarly in similar
contexts.  In any case, it is not always clearly defined whether two
characters are actually the same or not.  In practice, however, this
is more or less agreed upon.

A character set is just that, a set of one or more characters.  The
set is unique in that there will not be more than one instance of the
same character in a character set, and logically is unordered,
although an order is often imposed or suggested for the characters in
the character set.  We can also define an order on a character set,
which is a way of assigning a unique number, or possibly a pair of
numbers, or a triplet of numbers, or even a set of four or more
numbers to each character in the character set.  The combination of an
order in the character set results in an ordered character set.  In an
ordered character set, there is an upper limit and a lower limit on
the possible values that a character, or that any number within the
set of numbers assigned to a character, can take.  However, the lower
limit does not have to start at zero or one, or anywhere else in
particular, nor does the upper limit have to end anywhere particular,
and there may be gaps within these ranges such that particular numbers
or sets of numbers do not have a corresponding character, even though
they are within the upper and lower limits.  For example, ASCII
defines a very standard ordered character set.  It is normally defined
to be 94 characters in the range 33 through 126 inclusive on both
ends, with every possible character within this range being actually
present in the character set.

Sometimes the ASCII character set is extended to include what are
called non-printing characters.  Non-printing characters are
characters which instead of really being displayed in a more or less
rectangular block, like all other characters, instead indicate certain
functions typically related to either control of the display upon
which the characters are being displayed, or have some effect on a
communications channel that may be currently open and transmitting
characters, or may change the meaning of future characters as they are
being decoded, or some other similar function.  You might say that
non-printing characters are somewhat of a hack because they are a
special exception to the standard concept of a character as being a
printed glyph that has some direct correspondence in the non-computer
world.

With non-printing characters in mind, the 94-character ordered
character set called ASCII is often extended into a 96-character
ordered character set, also often called ASCII, which includes in
addition to the 94 characters already mentioned, two non-printing
characters, one called space and assigned the number 32, just below
the bottom of the previous range, and another called delete or rubout,
which is given number 127 just above the end of the previous range.
Thus to reiterate, the result is a 96-character ordered character set,
whose characters take the values from 32 to 127 inclusive.  Sometimes
ASCII is further extended to contain 32 more non-printing characters,
which are given the numbers zero through 31 so that the result is a
128-character ordered character set with characters numbered zero
through 127, and with many non-printing characters.  Another way to
look at this, and the way that is normally taken by XEmacs MULE, is
that the characters that would be in the range 30 through 31 in the
most extended definition of ASCII, instead form their own ordered
character set, which is called control zero, and consists of 32
characters in the range zero through 31.  A similar ordered character
set called control one is also created, and it contains 32 more
non-printing characters in the range 128 through 159.  Note that none
of these three ordered character sets overlaps in any of the numbers
they are assigned to their characters, so they can all be used at
once.  Note further that the same character can occur in more than one
character set.  This was shown above, for example, in two different
ordered character sets we defined, one of which we could have called
ASCII, and the other ASCII-extended, to show that it had extended by
two non-printable characters.  Most of the characters in these two
character sets are shared and present in both of them.

Note that there is no restriction on the size of the character set, or
on the numbers that are assigned to characters in an ordered character
set.  It is often extremely useful to represent a sequence of
characters as a sequence of bytes, where a byte as defined above is a
number in the range zero to 255.  And encoding does precisely this, it
is simply a mapping from a sequence of characters, possibly augmented
with information indicating the character set that each of these
characters belongs to, to a sequence of bytes which represents that
sequence of characters and no other, which is to say the mapping is
reversible.

A coding system is a set of rules for encoding a sequence of
characters augmented with character set information into a sequence of
bytes, and later performing the reverse operation.  It is frequently
possible to group coding systems into classes or types based on common
features.  Typically, for example, a particular coding system class
may contain a base coding system which specifies some of the rules,
but leaves the rest unspecified.  Individual members of the coding
system class are formed by starting with the base coding system, and
augmenting it with additional rules to produce a particular coding
system, what you might think of as a sort of variation within a
theme.

XEmacs Specific Definitions
===========================

First of all, in XEmacs, the concept of character is a little
different from the general definition given above.  For one thing, the
character set that a character belongs to may or may not be an
inherent part of the character itself.  In other words, the same
character occurring in two different character sets may appear in
XEmacs as two different characters.  This is generally the case now,
but we are attempting to move in the other direction.  Different
proposals may have different ideas about exactly the extent to which
this change will be carried out.  The general trend, though, is to
represent all information about a character other than the character
itself, using text properties attached to the character.  That way two
instances of the same character will look the same to lisp code that
merely retrieves the character, and does not also look at the text
properties of that character.  Everyone involved is in agreement in
doing it this way with all Latin characters, and in fact for all
characters other than Chinese, Japanese, and Korean ideographs.  For
those, there may be a difference of opinion.

A second difference between the general definition of character and
the XEmacs usage of character is that each character is assigned a
unique number that distinguishes it from all other characters in the
world, or at the very least, from all other characters currently
existing anywhere inside the current XEmacs invocation.  (If there is
a case where the weaker statement applies, but not the stronger
statement, it would possibly be with composite characters and any
other such characters that are created on the sly.)

This unique number is called the character representation of the
character, and its particular details are a matter of debate.  There
is the current standard in use that it is undoubtedly going to
change.  What has definitely been agreed upon is that it will be an
integer, more specifically a positive integer, represented with less
than or equal to 31 bits on a 32-bit architecture, and possibly up to
63 bits on a 64-bit architecture, with the proviso that any characters
that whose representation would fit in a 64-bit architecture, but not
on a 32-bit architecture, would be used only for composite characters,
and others that would satisfy the weak uniqueness property mentioned
above, but not with the strong uniqueness property.

At this point, it is useful to talk about the different
representations that a sequence of characters can take.  The simplest
representation is simply as a sequence of characters, and this is
called the lisp representation of text, because it is the
representation that lisp programs see.  Other representations include
the external representation, which refers to any encoding of the
sequence of characters, using the definition of encoding mentioned
above.  Typically, text in the external representation is used outside
of XEmacs, for example in files, e-mail messages, web sites, and the
like.  Another representation for a sequence of characters is what I
will call the byte representation, and it represents the way that
XEmacs internally represents text in a buffer, or in a string.
Potentially, the representation could be different between a buffer
and a string, and then the terms buffer byte representation and string
byte representation would be used, but in practice I don't think this
will occur.  It will be possible, of course, for buffers and strings,
or particular buffers and particular strings, to contain different
sub-representations of a single representation.  For example,
Olivier's 1-2-4 proposal allows for three sub-representations of his
internal byte representation, allowing for 1 byte, 2 bytes, and 4 byte
width characters respectively.  A particular string may be in one
sub-representation, and a particular buffer in another
sub-representation, but overall both are following the same byte
representation.  I do not use the term internal representation here,
as many people have, because it is potentially ambiguous.  Another
representation is called the array of characters representation.  This
is a representation on the C-level in which the sequence of text is
represented, not using the byte representation, but by using an array
of characters, each represented using the character representation.
This sort of representation is often used by redisplay because it is
more convenient to work with than any of the other internal
representations.  The term binary representation may also be heard.
Binary representation is used to represent binary data.  When binary
data is represented in the lisp representation, an equivalence is
simply set up between bytes zero through 255, and characters zero
through 255.  These characters come from four character sets, which
are from bottom to top, control zero, ASCII, control 1, and Latin 1.
Together, they comprise 256 characters, and are a good mapping for the
256 possible bytes in a binary representation.  Binary representation
could also be used to refer to an external representation of the
binary data, which is a simple direct byte-to-byte representation.  No
internal representation should ever be referred to as a binary
representation because of ambiguity.  The terms character set/encoding
system were defined generally, above.  In XEmacs, the equivalent
concepts exist, although character set has been shortened to charset,
and in fact represents specifically an ordered character set.  For
each possible charset, and for each possible coding system, there is
an associated object in XEmacs.  These objects will be of type charset
and coding system, respectively.  Charsets and coding systems are
divided into classes, or types, the normal term under XEmacs, and all
possible charsets encoding systems that may be defined must be in one
of these types.  If you need to create a charset or coding system that
is not one of these types, you will have to modify the C code to
support this new type.  Some of the existing or soon-to-be-created
types are, or will be, generic enough so that this shouldn't be an
issue.  Note also that the byte encoding for text and the character
coding of a character are closely related.  You might say that ideally
each is the simplest equivalent of the other given the general
constraints on each representation.

To be specific, in the current MULE representation,

1. Characters encode both the character itself and the character set
   that it comes from.  These character sets are always assumed to be
   representable as an ordered character set of size 96 or of size 96
   by 96, or the trivially-related sizes 94 and 94 by 94.  The only
   allowable exceptions are the control zero and control one character
   sets, which are of size 32.  Character sets which do not naturally
   have a compatible ordering such as this are shoehorned into an
   ordered character set, or possibly two ordered character sets of a
   compatible size.

2.  The variable width byte representation was deliberately chosen to
    allow scanning text forwards and backwards efficiently.  This
    necessitated defining the possible bytes into three ranges which
    we shall call A, B, and C.  Range A is used exclusively for
    single-byte characters, which is to say characters that are
    representing using only one contiguous byte.  Multi-byte
    characters are always represented by using one byte from Range B,
    followed by one or more bytes from Range C.  What this means is
    that bytes that begin a character are unequivocally distinguished
    from bytes that do not begin a character, and therefore there is
    never a problem scaling backwards and finding the beginning of a
    character.  Know that UTF8 adopts a proposal that is very similar
    in spirit in that it uses separate ranges for the first byte of a
    multi byte sequence, and the following bytes in multi-byte
    sequence.

3. Given the fact that all ordered character sets allowed were
   essentially 96 characters per dimension, it made perfect sense to
   make Range C comprise 96 bytes.  With a little more tweaking, the
   currently-standard MULE byte representation was created, and was
   drafted from this.

4. The MULE byte representation defined four basic representations for
   characters, which would take up from one to four bytes,
   respectively.  The MULE character representation thus had the
   following constraints:

      A. Character numbers zero through 255 should represent the
         characters that binary values zero through 255 would be
         mapped onto.  (Note: this was not the case in Kenichi Handa's
         version of this representation, but I changed it.)

      B. The four sub-classes of representation in the MULE byte
         representation should correspond to four contiguous
         non-overlapping ranges of characters.

      C. The algorithmic conversion between the single character
         represented in the byte representation and in the character
         representation should be as easy as possible.

      D. Given the previous constraints, the character representation
         should be as compact as possible, which is to say it should
         use the least number of bits possible.

So you see that the entire structure of the byte and character
representations stemmed from a very small number of basic choices,
which were
 a) the choice to encode character set information in a character
 b) the choice to assume that all character sets would have an order
    opposed upon them with 96 characters per one or two
    dimensions. (This is less arbitrary than it seems--it follows
    ISO-2022)
 c) the choice to use a variable width byte representation.

What this means is that you cannot really separate the byte
representation, the character representation, and the assumptions made
about characters and whether they represent character sets from each
other.  All of these are closely intertwined, and for purposes of
simplicity, they should be designed together.  If you change one
representation without changing another, you are in essence creating a
completely new design with its own attendant problems--since your new
design is likely to be quite complex and not very coherent with
regards to the translation between the character and byte
representations, you are likely to run into problems.