Internals Manual - [there are many people who work on the XEmacs C
code, and in fact new ones come and go periodically.  For this reason
we need an internals manual that documents how the internal structure
of XEmacs, particularly on the C side works.  Such an introduction
was missing for Mule, although there were already sections describing
particular aspects of the internationalization features of XEmacs.  I
wrote this document specifically as a broad introduction to the
Internationalization (aka. Mule) issues that any coder working on
XEmacs is likely to encounter.]

Introduction to Mule Issues

In XEmacs, Mule is a code word for the support for input handling and
display of multi-lingual text.  This section provides an overview of
how this support impacts the C and Lisp code in XEmacs.  It is
important for anyone who works on the C or the Lisp code, especially
on the C code, to be aware of these issues, even if they don't work
directly on code that implements multi-lingual features, because there
are various general procedures that need to be followed in order to
write Mule-compliant code.  (The specifics of these procedures are
documented elsewhere in this manual.)

There are four primary aspects of Mule support:

1)
internal handling and representation of multi-lingual text.

2)
conversion between the internal representation of text and the various
external representations in which multi-lingual text is encoded, such
as Unicode representations (including mostly fixed width encodings
such as UCS-2/UTF-16 and UCS-4 and a variable width ASCII compliant
encodings, such as UTF-7 and UTF-8); the various ISO2022
representations, which typically use escape sequences to switch
between different character sets (such as Compound Text, used under X
Windows, and JIS and EUC, used specifically for encoding Japanese);
Microsoft's multi-byte encodings (such as Shift-JIS); various simple
encodings for particular 8-bit character sets (such as Latin-1 and
Latin-2, and encodings (such as koi8 and Alternativny) for Cyrillic);
and others.  This conversion needs to happen both for text in files
and text sent to or retrieved from system API calls.  It even needs to
happen for external binary data because the internal representation
does not represent binary data simply as a sequence of bytes as it is
represented externally.

3)
Proper display of multi-lingual characters.

4)
Input of multi-lingual text using the keyboard.

These four aspects are for the most part independent of each other.

INTERNAL REPRESENTATION OF TEXT
===============================

In an ASCII world, life is very simple.  There are 256 characters, and
each character is represented using the numbers 0 through 255, which
fit into a single byte.  In the multi-lingual world, however, it is
much more complicated.  There are a great number of different
characters which are organized in a complex fashion into various
character sets.  The representation to use is not obvious because
there are issues of size versus speed to consider.  In fact, there are
in general two kinds of representations to work with: one that
represents a single character using an integer (possibly a byte), and
the other representing a single character as a sequence of bytes.  The
former representation is normally called fixed width, and the other
variable width. Both representations represent exactly the same
characters, and the conversion from one representation to the other is
governed by a specific formula (rather than by table lookup) but it
may not be simple.  Most C code need not, and in fact should not, know
the specifics of exactly how the representations work.  In fact, the
code must not make assumptions about the representations.  This means
in particular that it must use the proper macros for retrieving the
character at a particular memory location, determining how many
characters are present in a particular stretch of text, and
incrementing a pointer to a particular character to point to the
following character, and so on.  It must not assume that one character
is stored using one byte, or even using any particular number of
bytes.  It must not assume that the number of characters in a stretch
of text bears any particular relation to a number of bytes in that
stretch.  It must not assume that the character at a particular memory
location can be retrieved simply by dereferencing the memory location,
even if a character is known to be ASCII or is being compared with an
ASCII character, etc.  Careful coding is required to be Mule clean.
The biggest work of adding Mule support, in fact, is converting all of
the existing code to be Mule clean.

Lisp code is mostly unaffected by these concerns.  Text in strings and
buffers appears simply as a sequence of characters regardless of
whether Mule support is present.  The biggest difference between older
version of Emacs and between current versions of FSF Emacs is that
integers and characters are no longer equivalent, but are separate
Lisp Object types.

CONVERSION BETWEEN INTERNAL AND EXTERNAL REPRESENTATIONS
========================================================

All text needs to be converted to an external representation before
being sent to a function or file, and all text retrieved from a
function of file needs to be converted to the internal representation.
This conversion needs to happen as close to the source or destination
of the text as possible.  No operations should ever be performed on
text encoded in an external representation other than simple copying,
because no assumptions can reliably be made about the format of this
text.  You cannot assume, for example, that the end of text is
terminated by a null byte. (For example, if the text is Unicode, it
will have many null bytes in it.)  You cannot find the next "slash"
character by searching through the bytes until you find a byte that
looks like a "slash" character, because it might actually be the
second byte of a Kanji character.  Furthermore, all text in the
internal representation must be converted, even if it is known to be
completely ASCII, because the external representation may not be ASCII
compatible (for example, if it is Unicode).

The place where C code needs to be the most careful is when calling
external API functions.  It is easy to forget that all text passed to
or retrieved from these functions needs to be converted.  This
includes text in structures passed to or retrieved from these
functions and all text that is passed to a callback function that is
called by the system.

Macros are provided to perform conversions to or from external text.
These macros are called TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT
respectively.  These macros accept input in various forms, for
example, Lisp strings, buffers, lstreams, raw data, and can return
data in multiple formats, including both malloc()ed and alloca()ed
data.  The use of alloca()ed data here is particularly important
because, in general, the returned data will not be used after making
the API call, and as a result, using alloca()ed data provides a very
cheap and easy to use method of allocation.

These macros take a coding system argument which indicates the nature
of the external encoding.  A coding system is an object that
encapsulates the structures of a particular external encoding and the
methods required to convert to and from this encoding.  A facility
exists to create coding system aliases, which in essence gives a
single coding system two different names.  It is effectively used in
XEmacs to provide a layer of abstraction on top of the actual coding
systems.  For example, the coding system alias "file-name" points to
whichever coding system is currently used for encoding and decoding
file names as passed to or retrieved from system calls.  In general,
the actual encoding will differ from system to system, and also on the
particular locale that the user is in.  The use of the file-name alias
effectively hides that implementation detail on top of that abstract
interface layer which provides a unified set of coding systems which
are consistent across all operating environments.

The choice of which coding system to use in a particular conversion
macro requires some thought.  In general, you should choose a
lower-level actual coding system when the very design of the APIs you
are working with call for that particular coding system.  In all other
cases, you should find the least general abstract coding system
(i.e. coding system alias) that applies to your specific situation.
Only use the most general coding systems, such as native, when there
is simply nothing else that is more appropriate.  By doing things this
way, you allow the user more control over how the encoding actually
works, because the user is free to map the abstracted coding system
names onto to different actual coding systems.

Some common coding systems are:

 - ctext: Compound Text, which is the standard encoding under X
   Windows, which is used for clipboard data and possibly other data.
   (ctext is a coding system of type ISO2022.)

 - mswindows-unicode: this is used for representing text passed to MS
   Window API calls with arguments that need to be in Unicode format.
   (mswindows-unicode is a coding system of type UTF-16)

 - ms-windows-multi-byte: this is used for representing text passed to
   MS Windows API calls with arguments that need to be in multi-byte
   format.  Note that there are very few if any examples of such
   calls.

 - mswindows-tstr: this is used for representing text passed to any MS
   Windows API calls that declare their argument as LPTSTR, or
   LPCTSTR.  This is the vast majority of system calls and
   automatically translates either to mswindows-unicode or
   mswindows-multi-byte, depending on the presence or absence of the
   UNICODE preprocessor constant.  (If we compile XEmacs with this
   preprocessor constant, then all API calls use Unicode for all text
   passed to or received from these API calls.)

 - terminal: used for text sent to or read from a text terminal in the
   absence of a more specific coding system (calls to window-system
   specific APIs should use the appropriate window-specific coding
   system if it makes sense to do so.)

 - file-name: used when specifying the names of files in the absence of
   a more specific encoding, such as ms-windows-tstr.

 - native: the most general coding system for specifying text passed to
   system calls.  This generally translates to whatever coding system
   is specified by the current locale.  This should only be used when
   none of the coding systems mentioned above are appropriate.

PROPER DISPLAY OF MULTILINGUAL TEXT
===================================

There are two things required to get this working correctly.  One is
selecting the correct font, and the other is encoding the text
according to the encoding used for that specific font, or the
window-system specific text display API.  Generally each separate
character set has a different font associated with it, which is
specified by name and each font has an associated encoding into which
the characters must be translated.  (this is the case on X Windows, at
least; on Windows there is a more general mechanism).  Both the
specific font for a charset and the encoding of that font are system
dependent.  Currently there is a way of specifying these two
properties under X Windows (using the registry and ccl properties of a
character set) but not for other window systems.  A more general
system needs to be implemented to allow these characteristics to be
specified for all Windows systems.

Another issue is making sure that the necessary fonts for displaying
various character sets are installed on the system.  Currently, XEmacs
provides, on its web site, X Windows fonts for a number of different
character sets that can be installed by users.  This isn't done yet
for Windows, but it should be.

INPUTTING OF MULTILINGUAL TEXT
==============================

This is a rather complicated issue because there are many paradigms
defined for inputting multi-lingual text, some of which are specific
to particular languages, and any particular language may have many
different paradigms defined for inputting its text.  These paradigms
are encoded in input methods and there is a standard API for defining
an input method in XEmacs called LEIM, or Library of Emacs Input
Methods.  Some of these input methods are written entirely in Elisp,
and thus are system-independent, while others require the aid either
of an external process, or of C level support that ties into a
particular system-specific input method API, for example, XIM under X
Windows, or the active keyboard layout and IME support under Windows.
Currently, there is no support for any system-specific input methods
under Microsoft Windows, although this will change.