Next: , Up: Internal Mule Encodings   [Contents][Index]


18.3.1 Internal String Encoding

ASCII characters are encoded using their position code directly. Other characters are encoded using their leading byte followed by their position code(s) with the high bit set. Characters in private character sets have their leading byte prefixed with a leading byte prefix, which is either 0x9E or 0x9F. (No character sets are ever assigned these leading bytes.) Specifically:

Character set           Encoding (PC=position-code, LB=leading-byte)
-------------           --------
ASCII                   PC-1 |
Control-1               LB   |  PC1 + 0xA0 |
Dimension-1 official    LB   |  PC1 + 0x80 |
Dimension-1 private     0x9E |  LB         | PC1 + 0x80 |
Dimension-2 official    LB   |  PC1 + 0x80 | PC2 + 0x80 |
Dimension-2 private     0x9F |  LB         | PC1 + 0x80 | PC2 + 0x80

The basic characteristic of this encoding is that the first byte of all characters is in the range 0x00 - 0x9F, and the second and following bytes of all characters is in the range 0xA0 - 0xFF. This means that it is impossible to get out of sync, or more specifically:

  1. Given any byte position, the beginning of the character it is within can be determined in constant time.
  2. Given any byte position at the beginning of a character, the beginning of the next character can be determined in constant time.
  3. Given any byte position at the beginning of a character, the beginning of the previous character can be determined in constant time.
  4. Textual searches can simply treat encoded strings as if they were encoded in a one-byte-per-character fashion rather than the actual multi-byte encoding.

None of the standard non-modal encodings meet all of these conditions. For example, EUC satisfies only (2) and (3), while Shift-JIS and Big5 (not yet described) satisfy only (2). (All non-modal encodings must satisfy (2), in order to be unambiguous.)


Next: , Up: Internal Mule Encodings   [Contents][Index]