Introduction to Internationalization Issues in the Win32 API

Abstract: This page provides an overview of the aspects of the Win32
internationalization API that are relevant to XEmacs, including the
basic distinction between multibyte and Unicode encodings. Also
included are pointers to how XEmacs should make use of this API.

The Win32 API is quite well-designed in its handling of strings
encoded for various character sets. The API is geared around the idea
that two different methods of encoding strings should be
supported. These methods are called multibyte and Unicode,
respectively. The multibyte encoding is compatible with ASCII strings
and is a more efficient representation when dealing with strings
containing primarily ASCII characters, but it has a great number of
serious deficiencies and limitations, including that it is very
difficult and error-prone to work with strings in this encoding, and
any particular string in a multibyte encoding can only contain
characters from a very limited number of character sets. The Unicode
encoding rectifies all of these deficiencies, but it is not compatible
with ASCII strings (in other words, an existing program will not be
able to handle the encoded strings unless it is explicitly modified to
do so), and it takes up twice as much memory space as multibyte
encodings when encoding a purely ASCII string.

Multibyte encodings use a variable number of bytes (either one or two)
to represent characters. ASCII characters are also represented by a
single byte with its high bit not set, and non-ASCII characters are
represented by one or two bytes, the first of which always has its
high bit set. (The second byte, when it exists, may or may not have
its high bit set.) There is no single multibyte encoding. Instead,
there is generally one encoding per non-ASCII character set. Such an
encoding is capable of representing (besides ASCII characters, of
course) only characters from one (or possibly two) particular
character sets.

Multibyte encoding makes processing of strings very difficult. For
example, given a pointer to the beginning of a character within a
string, finding the pointer to the beginning of the previous character
may require backing up all the way to the beginning of the string, and
then moving forward. Also, an operation such as separating out the
components of a path by searching for backslashes will fail if it's
implemented in the simplest (but not multibyte-aware) fashion, because
it may find what appears to be a backslash, but which is actually the
second byte of a two-byte character. Also, the limited number of
character sets that any particular multibyte encoding can represent
means that loss of data is likely if a string is converted from the
XEmacs internal format into a multibyte format.

For these reasons, the C code in XEmacs should never do any sort of
work with multibyte encoded strings (or with strings in any external
encoding for that matter). Strings should always be maintained in the
internal encoding, which is predictable, and converted to an external
encoding only at the point where the string moves from the XEmacs C
code and enters a system library function. Similarly, when a string is
returned from a system library function, it should be immediately
converted into the internal coding before any operations are done on
it.

Unicode, unlike multibyte encodings, is a fixed-width encoding where
every character is represented using 16 bits. It is also capable of
encoding all the characters from all the character sets in common use
in the world. The predictability and completeness of the Unicode
encoding makes it a very good encoding for strings that may contain
characters from many character sets mixed up with each other. At the
same time, of course, it is incompatible with routines that expect
ASCII characters and also incompatible with general string
manipulation routines, which will encounter a great number of what
would appear to be embedded nulls in the string. It also takes twice
as much room to encode strings containing primarily ASCII
characters. This is why XEmacs does not use Unicode or similar
encoding internally for buffers.

The Win32 API cleverly deals with the issue of 8 bit vs. 16 bit
characters by declaring a type called TCHAR which specifies a generic
character, either 8 bits or 16 bits. Generally TCHAR is defined to be
the same as the simple C type char, unless the preprocessor constant
UNICODE is defined, in which case TCHAR is defined to be WCHAR, which
is a 16 bit type. Nearly all functions in the Win32 API that take
strings are defined to take strings that are actually arrays of
TCHARs. There is a type LPTSTR which is defined to be a string of
TCHARs and another type LPCTSTR which is a const string of TCHARs. The
theory is that any program that uses TCHARs exclusively to represent
characters and does not make assumptions about the size of a TCHAR or
the way that the characters are encoded should work transparently
regardless of whether the UNICODE preprocessor constant is defined,
which is to say, regardless of whether 8 bit multibyte or 16 bit
Unicode characters are being used. The way that this is actually
implemented is that every Win32 API function that takes a string as an
argument actually maps to one of two functions which are suffixed with
an A (which stands for ANSI, and means multibyte strings) or W (which
stands for wide, and means Unicode strings). The mapping is, of
course, controlled by the same UNICODE preprocessor
constant. Generally all structures containing strings in them actually
map to one of two different kinds of structures, with either an A or a
W suffix after the structure name.

Unfortunately, not all of the implementations of the Win32 API
implement all of the functionality described above. In particular,
Windows 95 does not implement very much Unicode functionality. It does
implement functions to convert multibyte-encoded strings to and from
Unicode strings, and provides Unicode versions of certain low-level
functions like ExtTextOut(). In fact, all of the rest of the Unicode
versions of API functions are just stubs that return an
error. Conversely, all versions of Windows NT completely implement all
the Unicode functionality, but some versions (especially versions
before Windows NT 4.0) don't implement much of the multibyte
functionality. For this reason, as well as for general code
cleanliness, XEmacs needs to be written in such a way that it works
with or without the UNICODE preprocessor constant being defined.

Getting XEmacs to run when all strings are Unicode primarily involves
removing any assumptions made about the size of characters. Remember
what I said earlier about how the point of conversion between
internally and externally encoded strings should occur at the point of
entry or exit into or out of a library function. With this in mind, an
externally encoded string in XEmacs can be treated simply as an
arbitrary sequence of bytes of some length which has no particular
relationship to the length of the string in the internal encoding.

To facilitate this, the enum external_data_format, which is declared
in lisp.h, is expanded to contain three new formats, which are
FORMAT_LOCALE, FORMAT_UNICODE and FORMAT_TSTR. FORMAT_LOCALE always
causes encoding into a multibyte string consistent with the encoding
of the current locale. The functions to handle locales are different
under Unix and Windows and locales are a process property under Unix
and a thread property under Windows, but the concepts are basically
the same. FORMAT_UNICODE of course causes encoding into Unicode and
FORMAT_TSTR logically maps to either FORMAT_LOCALE or FORMAT_UNICODE
depending on the UNICODE preprocessor constant.

Under Unix the behavior of FORMAT_TSTR is undefined and this
particular format should not be used. Under Windows however
FORMAT_TSTR should be used for pretty much all of the Win32 API
calls. The other two formats should only be used in particular APIs
that specifically call for a multibyte or Unicode encoded string
regardless of the UNICODE preprocessor constant. String constants that
are to be passed directly to Win32 API functions, such as the names of
window classes, need to be bracketed in their definition with a call
to the macro TEXT. This awfully named macro, which comes out of the
Win32 API, appropriately makes a string of either regular or wide
chars, which is to say this string may be prepended with an L (causing
it to be a wide string) depending on the UNICODE preprocessor
constant.

By the way, if you're wondering what happened to FORMAT_OS, I think
that this format should go away entirely because it is too vague and
should be replaced by more specific formats as they are defined.