- Support for unified internal representation, i.e. Unicode - creation of generic macros for accessing internally formatted data. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< I have a design; it's all written down (I did it in Tsukuba), and I just have to have it transcribed. It's higher level than the macros, though; it's Lisp primitives that I'm designing. As for the design of the macros, don't worry so much about all files having to get included (which is inevitable with macros), but about how the files are separated. Your design might go like this: 1. you have generic macro interfaces, which specify a particular behavior but not an implementation. these generic macros have complementary versions for buffers and for strings (and the buffer or string is an argument to all of the macros), and do such things as convert between byte and char indices, retrieve the character at a particular byte or char index, increment or decrement a byte index to the beginning of the next or previous character, indicate the number of bytes occupied by the character at a particular byte or character index, etc. These are similar to what's already out there except that they confound buffers and strings and that they can also work with actual char *'s, which I think is a really bad idea because it encourages code to "assume" that the representation is ASCII compatible, which is might not be (e.g. 16-bit fixed width). In fact, one thing I'm planning on doing is redefining Bufbyte as a struct, for debugging purposes, to catch all places that cavalierly compare them with ASCII char's. Note also that I really want to rename Bufpos and Bytind, which are confusing and wrong in that they also apply to strings. They should be Bytepos and Charpos, or something like that, to go along with Bytecount and Charcount. Similarly, Bufbyte is similarly a misnomer and should be Intbyte -- a byte in the internal string representation (any of the internal representations) of a string or buffer. Corresponding to this is Extbyte (which we already have), a byte in any external string representation. We also have Extcount, which makes sense, and we might possibly want Extcharcount, the number of characters in an external string representation; but that gets sticky in modal encodings, and it's not clear how useful it would be. 2. for all generic macro interfaces, there are specific versions of each of them for each possible representation (pure ASCII in the non-Mule world, Mule standard, UTF-8, 8-bit fixed, 16-bit fixed, 32-bit fixed, etc.; there may well be more than one possible 16-bit fixed version, as well). Each representation has a corresponding prefix, e.g. MULE_ or FIXED16_ or whatever, which is prefixed onto the generic macro names. The resulting macros perform the operation defined for the macro, but assume, and only work correctly with, text in the corresponding representation. 3. The definition of the generic versions merely conditionalizes on the appropriate things (i.e. bit flags in the buffer or string object) and calls the appropriate representation-specific version. There may be more than one definition (protected by ifdefs, of course), or one definition that amalgamated out of many ifdef'ed sections. 4. You should probably put each different representation in its own header file, e.g. charset-mule.h or charset-fixed16.h or charset-ascii.h or whatever. Then put the main macros into charset.h, and conditionalize in this file appropriately to include the other ones. That way, code that actually needs to play around with internal-format text at this level can include "charset.h" (certainly a much better place than buffer.h), and everyone else uses higher-level routines. The representation-specific macros should not normally be used *directly* at all; they are invoked automatically from the generic macros. However, code that needs to be highly, highly optimized might choose to take a loop and write two versions of it, one for each representation, to avoid the per-loop-iteration cost of a comparison. Until the macro interface is rock stable and solid, we should strongly discourage such nanosecond optimizations. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - UTF-16 compatible representation <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< NOTE: One possible default internal representation that was compatible with UTF16 but allowed all possible chars in UCS4 would be to take a more-or-less unused range of 2048 chars (not from the private area because Microsoft actually uses up most or all of it with EUDC chars). Let's say we picked A400 - ABFF. Then, we'd have: 0000 - FFFF Simple chars D[8-B]xx D[C-F]xx Surrogate char, represents 1M chars A[4-B]xx D[C-F]xx D[C-F]xx Surrogate char, represents 2G chars This is exactly the same number of chars as UCS-4 handles, and it follows the same property as UTF8 and Mule-internal: 1. There are two disjoint groupings of units, one representing leading units and one representing non-leading units. 2. Given a leading unit, you immediately know how many units follow to make up a valid char, irrespective of any other context. Note that A4xx is actually currently assigned to Yi. Since this is an internal representation, we could just move these elsewhere. An alternative is to pick two disjoint ranges, e.g. 2D00 - 2DFF and A500 - ABFF. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - merging of UTF-2000 code - will support language tagging using text properties