- Support for unified internal representation, i.e. Unicode

  - creation of generic macros for accessing internally formatted data.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
I have a design; it's all written down (I did it in Tsukuba), and I just have
to have it transcribed.  It's higher level than the macros, though; it's Lisp
primitives that I'm designing.

As for the design of the macros, don't worry so much about all files having to
get included (which is inevitable with macros), but about how the files are
separated.  Your design might go like this:

1. you have generic macro interfaces, which specify a particular
   behavior but not an implementation.  these generic macros have
   complementary versions for buffers and for strings (and the buffer
   or string is an argument to all of the macros), and do such things
   as convert between byte and char indices, retrieve the character at
   a particular byte or char index, increment or decrement a byte
   index to the beginning of the next or previous character, indicate
   the number of bytes occupied by the character at a particular byte
   or character index, etc.  These are similar to what's already out
   there except that they confound buffers and strings and that they
   can also work with actual char *'s, which I think is a really bad
   idea because it encourages code to "assume" that the representation
   is ASCII compatible, which is might not be (e.g. 16-bit fixed
   width).  In fact, one thing I'm planning on doing is redefining
   Bufbyte as a struct, for debugging purposes, to catch all places
   that cavalierly compare them with ASCII char's.  Note also that I
   really want to rename Bufpos and Bytind, which are confusing and
   wrong in that they also apply to strings. They should be Bytepos
   and Charpos, or something like that, to go along with Bytecount and
   Charcount. Similarly, Bufbyte is similarly a misnomer and should be
   Intbyte -- a byte in the internal string representation (any of the
   internal representations) of a string or buffer.  Corresponding to
   this is Extbyte (which we already have), a byte in any external
   string representation.  We also have Extcount, which makes sense,
   and we might possibly want Extcharcount, the number of characters
   in an external string representation; but that gets sticky in modal
   encodings, and it's not clear how useful it would be.

2. for all generic macro interfaces, there are specific versions of
   each of them for each possible representation (pure ASCII in the
   non-Mule world, Mule standard, UTF-8, 8-bit fixed, 16-bit fixed,
   32-bit fixed, etc.; there may well be more than one possible 16-bit
   fixed version, as well). Each representation has a corresponding
   prefix, e.g. MULE_ or FIXED16_ or whatever, which is prefixed onto
   the generic macro names.  The resulting macros perform the
   operation defined for the macro, but assume, and only work
   correctly with, text in the corresponding representation.

3. The definition of the generic versions merely conditionalizes on
   the appropriate things (i.e. bit flags in the buffer or string
   object) and calls the appropriate representation-specific version.
   There may be more than one definition (protected by ifdefs, of
   course), or one definition that amalgamated out of many ifdef'ed
   sections.

4. You should probably put each different representation in its own
   header file, e.g. charset-mule.h or charset-fixed16.h or
   charset-ascii.h or whatever.  Then put the main macros into
   charset.h, and conditionalize in this file appropriately to include
   the other ones.  That way, code that actually needs to play around
   with internal-format text at this level can include "charset.h"
   (certainly a much better place than buffer.h), and everyone else
   uses higher-level routines.  The representation-specific macros
   should not normally be used *directly* at all; they are invoked
   automatically from the generic macros.  However, code that needs to
   be highly, highly optimized might choose to take a loop and write
   two versions of it, one for each representation, to avoid the
   per-loop-iteration cost of a comparison. Until the macro interface
   is rock stable and solid, we should strongly discourage such
   nanosecond optimizations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

  - UTF-16 compatible representation

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
NOTE: One possible default internal representation that was compatible
with UTF16 but allowed all possible chars in UCS4 would be to take a
more-or-less unused range of 2048 chars (not from the private area
because Microsoft actually uses up most or all of it with EUDC chars).
Let's say we picked A400 - ABFF.  Then, we'd have:

0000 - FFFF    Simple chars

D[8-B]xx D[C-F]xx  Surrogate char, represents 1M chars

A[4-B]xx D[C-F]xx D[C-F]xx   Surrogate char, represents 2G chars

This is exactly the same number of chars as UCS-4 handles, and it follows the
same property as UTF8 and Mule-internal:

1. There are two disjoint groupings of units, one representing leading units
   and one representing non-leading units.
2. Given a leading unit, you immediately know how many units follow to make
   up a valid char, irrespective of any other context.

Note that A4xx is actually currently assigned to Yi.  Since this is an
internal representation, we could just move these elsewhere.

An alternative is to pick two disjoint ranges, e.g. 2D00 - 2DFF and
A500 - ABFF.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

  - merging of UTF-2000 code

  - will support language tagging using text properties