Internals Manual - [there are many people who work on the XEmacs C code, and in fact new ones come and go periodically. For this reason we need an internals manual that documents how the internal structure of XEmacs, particularly on the C side works. Such an introduction was missing for Mule, although there were already sections describing particular aspects of the internationalization features of XEmacs. I wrote this document specifically as a broad introduction to the Internationalization (aka. Mule) issues that any coder working on XEmacs is likely to encounter.] Introduction to Mule Issues In XEmacs, Mule is a code word for the support for input handling and display of multi-lingual text. This section provides an overview of how this support impacts the C and Lisp code in XEmacs. It is important for anyone who works on the C or the Lisp code, especially on the C code, to be aware of these issues, even if they don't work directly on code that implements multi-lingual features, because there are various general procedures that need to be followed in order to write Mule-compliant code. (The specifics of these procedures are documented elsewhere in this manual.) There are four primary aspects of Mule support: 1) internal handling and representation of multi-lingual text. 2) conversion between the internal representation of text and the various external representations in which multi-lingual text is encoded, such as Unicode representations (including mostly fixed width encodings such as UCS-2/UTF-16 and UCS-4 and a variable width ASCII compliant encodings, such as UTF-7 and UTF-8); the various ISO2022 representations, which typically use escape sequences to switch between different character sets (such as Compound Text, used under X Windows, and JIS and EUC, used specifically for encoding Japanese); Microsoft's multi-byte encodings (such as Shift-JIS); various simple encodings for particular 8-bit character sets (such as Latin-1 and Latin-2, and encodings (such as koi8 and Alternativny) for Cyrillic); and others. This conversion needs to happen both for text in files and text sent to or retrieved from system API calls. It even needs to happen for external binary data because the internal representation does not represent binary data simply as a sequence of bytes as it is represented externally. 3) Proper display of multi-lingual characters. 4) Input of multi-lingual text using the keyboard. These four aspects are for the most part independent of each other. INTERNAL REPRESENTATION OF TEXT =============================== In an ASCII world, life is very simple. There are 256 characters, and each character is represented using the numbers 0 through 255, which fit into a single byte. In the multi-lingual world, however, it is much more complicated. There are a great number of different characters which are organized in a complex fashion into various character sets. The representation to use is not obvious because there are issues of size versus speed to consider. In fact, there are in general two kinds of representations to work with: one that represents a single character using an integer (possibly a byte), and the other representing a single character as a sequence of bytes. The former representation is normally called fixed width, and the other variable width. Both representations represent exactly the same characters, and the conversion from one representation to the other is governed by a specific formula (rather than by table lookup) but it may not be simple. Most C code need not, and in fact should not, know the specifics of exactly how the representations work. In fact, the code must not make assumptions about the representations. This means in particular that it must use the proper macros for retrieving the character at a particular memory location, determining how many characters are present in a particular stretch of text, and incrementing a pointer to a particular character to point to the following character, and so on. It must not assume that one character is stored using one byte, or even using any particular number of bytes. It must not assume that the number of characters in a stretch of text bears any particular relation to a number of bytes in that stretch. It must not assume that the character at a particular memory location can be retrieved simply by dereferencing the memory location, even if a character is known to be ASCII or is being compared with an ASCII character, etc. Careful coding is required to be Mule clean. The biggest work of adding Mule support, in fact, is converting all of the existing code to be Mule clean. Lisp code is mostly unaffected by these concerns. Text in strings and buffers appears simply as a sequence of characters regardless of whether Mule support is present. The biggest difference between older version of Emacs and between current versions of FSF Emacs is that integers and characters are no longer equivalent, but are separate Lisp Object types. CONVERSION BETWEEN INTERNAL AND EXTERNAL REPRESENTATIONS ======================================================== All text needs to be converted to an external representation before being sent to a function or file, and all text retrieved from a function of file needs to be converted to the internal representation. This conversion needs to happen as close to the source or destination of the text as possible. No operations should ever be performed on text encoded in an external representation other than simple copying, because no assumptions can reliably be made about the format of this text. You cannot assume, for example, that the end of text is terminated by a null byte. (For example, if the text is Unicode, it will have many null bytes in it.) You cannot find the next "slash" character by searching through the bytes until you find a byte that looks like a "slash" character, because it might actually be the second byte of a Kanji character. Furthermore, all text in the internal representation must be converted, even if it is known to be completely ASCII, because the external representation may not be ASCII compatible (for example, if it is Unicode). The place where C code needs to be the most careful is when calling external API functions. It is easy to forget that all text passed to or retrieved from these functions needs to be converted. This includes text in structures passed to or retrieved from these functions and all text that is passed to a callback function that is called by the system. Macros are provided to perform conversions to or from external text. These macros are called TO_EXTERNAL_FORMAT and TO_INTERNAL_FORMAT respectively. These macros accept input in various forms, for example, Lisp strings, buffers, lstreams, raw data, and can return data in multiple formats, including both malloc()ed and alloca()ed data. The use of alloca()ed data here is particularly important because, in general, the returned data will not be used after making the API call, and as a result, using alloca()ed data provides a very cheap and easy to use method of allocation. These macros take a coding system argument which indicates the nature of the external encoding. A coding system is an object that encapsulates the structures of a particular external encoding and the methods required to convert to and from this encoding. A facility exists to create coding system aliases, which in essence gives a single coding system two different names. It is effectively used in XEmacs to provide a layer of abstraction on top of the actual coding systems. For example, the coding system alias "file-name" points to whichever coding system is currently used for encoding and decoding file names as passed to or retrieved from system calls. In general, the actual encoding will differ from system to system, and also on the particular locale that the user is in. The use of the file-name alias effectively hides that implementation detail on top of that abstract interface layer which provides a unified set of coding systems which are consistent across all operating environments. The choice of which coding system to use in a particular conversion macro requires some thought. In general, you should choose a lower-level actual coding system when the very design of the APIs you are working with call for that particular coding system. In all other cases, you should find the least general abstract coding system (i.e. coding system alias) that applies to your specific situation. Only use the most general coding systems, such as native, when there is simply nothing else that is more appropriate. By doing things this way, you allow the user more control over how the encoding actually works, because the user is free to map the abstracted coding system names onto to different actual coding systems. Some common coding systems are: - ctext: Compound Text, which is the standard encoding under X Windows, which is used for clipboard data and possibly other data. (ctext is a coding system of type ISO2022.) - mswindows-unicode: this is used for representing text passed to MS Window API calls with arguments that need to be in Unicode format. (mswindows-unicode is a coding system of type UTF-16) - ms-windows-multi-byte: this is used for representing text passed to MS Windows API calls with arguments that need to be in multi-byte format. Note that there are very few if any examples of such calls. - mswindows-tstr: this is used for representing text passed to any MS Windows API calls that declare their argument as LPTSTR, or LPCTSTR. This is the vast majority of system calls and automatically translates either to mswindows-unicode or mswindows-multi-byte, depending on the presence or absence of the UNICODE preprocessor constant. (If we compile XEmacs with this preprocessor constant, then all API calls use Unicode for all text passed to or received from these API calls.) - terminal: used for text sent to or read from a text terminal in the absence of a more specific coding system (calls to window-system specific APIs should use the appropriate window-specific coding system if it makes sense to do so.) - file-name: used when specifying the names of files in the absence of a more specific encoding, such as ms-windows-tstr. - native: the most general coding system for specifying text passed to system calls. This generally translates to whatever coding system is specified by the current locale. This should only be used when none of the coding systems mentioned above are appropriate. PROPER DISPLAY OF MULTILINGUAL TEXT =================================== There are two things required to get this working correctly. One is selecting the correct font, and the other is encoding the text according to the encoding used for that specific font, or the window-system specific text display API. Generally each separate character set has a different font associated with it, which is specified by name and each font has an associated encoding into which the characters must be translated. (this is the case on X Windows, at least; on Windows there is a more general mechanism). Both the specific font for a charset and the encoding of that font are system dependent. Currently there is a way of specifying these two properties under X Windows (using the registry and ccl properties of a character set) but not for other window systems. A more general system needs to be implemented to allow these characteristics to be specified for all Windows systems. Another issue is making sure that the necessary fonts for displaying various character sets are installed on the system. Currently, XEmacs provides, on its web site, X Windows fonts for a number of different character sets that can be installed by users. This isn't done yet for Windows, but it should be. INPUTTING OF MULTILINGUAL TEXT ============================== This is a rather complicated issue because there are many paradigms defined for inputting multi-lingual text, some of which are specific to particular languages, and any particular language may have many different paradigms defined for inputting its text. These paradigms are encoded in input methods and there is a standard API for defining an input method in XEmacs called LEIM, or Library of Emacs Input Methods. Some of these input methods are written entirely in Elisp, and thus are system-independent, while others require the aid either of an external process, or of C level support that ties into a particular system-specific input method API, for example, XIM under X Windows, or the active keyboard layout and IME support under Windows. Currently, there is no support for any system-specific input methods under Microsoft Windows, although this will change.