Next: , Up: MULE Character Sets and Encodings   [Contents][Index]


18.1 Character Sets

A character set (or charset) is an ordered set of characters. A particular character in a charset is indexed using one or more position codes, which are non-negative integers. The number of position codes needed to identify a particular character in a charset is called the dimension of the charset. In SXEmacs/Mule, all charsets have dimension 1 or 2, and the size of all charsets (except for a few special cases) is either 94, 96, 94 by 94, or 96 by 96. The range of position codes used to index characters from any of these types of character sets is as follows:

Charset type            Position code 1         Position code 2
------------------------------------------------------------
94                      33 - 126                N/A
96                      32 - 127                N/A
94x94                   33 - 126                33 - 126
96x96                   32 - 127                32 - 127

Note that in the above cases position codes do not start at an expected value such as 0 or 1. The reason for this will become clear later.

For example, Latin-1 is a 96-character charset, and JISX0208 (the Japanese national character set) is a 94x94-character charset.

[Note that, although the ranges above define the valid position codes for a charset, some of the slots in a particular charset may in fact be empty. This is the case for JISX0208, for example, where (e.g.) all the slots whose first position code is in the range 118 - 127 are empty.]

There are three charsets that do not follow the above rules. All of them have one dimension, and have ranges of position codes as follows:

Charset name            Position code 1
------------------------------------
ASCII                   0 - 127
Control-1               0 - 31
Composite               0 - some large number

(The upper bound of the position code for composite characters has not yet been determined, but it will probably be at least 16,383).

ASCII is the union of two subsidiary character sets: Printing-ASCII (the printing ASCII character set, consisting of position codes 33 - 126, like for a standard 94-character charset) and Control-ASCII (the non-printing characters that would appear in a binary file with codes 0 - 32 and 127).

Control-1 contains the non-printing characters that would appear in a binary file with codes 128 - 159.

Composite contains characters that are generated by overstriking one or more characters from other charsets.

Note that some characters in ASCII, and all characters in Control-1, are control (non-printing) characters. These have no printed representation but instead control some other function of the printing (e.g. TAB or 8 moves the current character position to the next tab stop). All other characters in all charsets are graphic (printing) characters.

When a binary file is read in, the bytes in the file are assigned to character sets as follows:

Bytes           Character set           Range
--------------------------------------------------
0 - 127         ASCII                   0 - 127
128 - 159       Control-1               0 - 31
160 - 255       Latin-1                 32 - 127

This is a bit ad-hoc but gets the job done.


Next: , Up: MULE Character Sets and Encodings   [Contents][Index]