Next: , Previous: , Up: Coding for Mule   [Contents][Index]


8.7.3 Conversion to and from External Data

When an external function, such as a C library function, returns a char pointer, you should almost never treat it as Bufbyte. This is because these returned strings may contain 8bit characters which can be misinterpreted by SXEmacs, and cause a crash. Likewise, when exporting a piece of internal text to the outside world, you should always convert it to an appropriate external encoding, lest the internal stuff (such as the infamous \201 characters) leak out.

The interface to conversion between the internal and external representations of text are the numerous conversion macros defined in buffer.h. There used to be a fixed set of external formats supported by these macros, but now any coding system can be used with these macros. The coding system alias mechanism is used to create the following logical coding systems, which replace the fixed external formats. The (dontusethis-set-symbol-value-handler) mechanism was enhanced to make this possible (more work on that is needed - like remove the dontusethis- prefix).

Qbinary

This is the simplest format and is what we use in the absence of a more appropriate format. This converts according to the binary coding system:

  1. On input, bytes 0–255 are converted into (implicitly Latin-1) characters 0–255. A non-Mule xemacs doesn’t really know about different character sets and the fonts to display them, so the bytes can be treated as text in different 1-byte encodings by simply setting the appropriate fonts. So in a sense, non-Mule xemacs is a multi-lingual editor if, for example, different fonts are used to display text in different buffers, faces, or windows. The specifier mechanism gives the user complete control over this kind of behavior.
  2. On output, characters 0–255 are converted into bytes 0–255 and other characters are converted into ‘~’.
Qfile_name

Format used for filenames. This is user-definable via either the file-name-coding-system or pathname-coding-system (now obsolete) variables.

Qnative

Format used for the external Unix environment—argv[], stuff from getenv(), stuff from the /etc/passwd file, etc. Currently this is the same as Qfile_name. The two should be distinguished for clarity and possible future separation.

Qctext

Compound–text format. This is the standard X11 format used for data stored in properties, selections, and the like. This is an 8-bit no-lock-shift ISO2022 coding system. This is a real coding system, unlike Qfile_name, which is user-definable.

There are two fundamental macros to convert between external and internal format.

TO_INTERNAL_FORMAT converts external data to internal format, and TO_EXTERNAL_FORMAT converts the other way around. The arguments each of these receives are a source type, a source, a sink type, a sink, and a coding system (or a symbol naming a coding system).

A typical call looks like

TO_EXTERNAL_FORMAT (LISP_STRING, str, C_STRING_MALLOC, ptr, Qfile_name);

which means that the contents of the lisp string str are written to a malloc’ed memory area which will be pointed to by ptr, after the function returns. The conversion will be done using the file-name coding system, which will be controlled by the user indirectly by setting or binding the variable file-name-coding-system.

Some sources and sinks require two C variables to specify. We use some preprocessor magic to allow different source and sink types, and even different numbers of arguments to specify different types of sources and sinks.

So we can have a call that looks like

TO_INTERNAL_FORMAT (DATA, (ptr, len),
                    MALLOC, (ptr, len),
                    coding_system);

The parenthesized argument pairs are required to make the preprocessor magic work.

Here are the different source and sink types:

DATA, (ptr, len),

input data is a fixed buffer of size len at address ptr

ALLOCA, (ptr, len),

output data is placed in an alloca()ed buffer of size len pointed to by ptr

MALLOC, (ptr, len),

output data is in a malloc()ed buffer of size len pointed to by ptr

C_STRING_ALLOCA, ptr,

equivalent to ALLOCA (ptr, len_ignored) on output.

C_STRING_MALLOC, ptr,

equivalent to MALLOC (ptr, len_ignored) on output

C_STRING, ptr,

equivalent to DATA, (ptr, strlen (ptr) + 1) on input

LISP_STRING, string,

input or output is a Lisp_Object of type string

LISP_BUFFER, buffer,

output is written to (point) in lisp buffer buffer

LISP_LSTREAM, lstream,

input or output is a Lisp_Object of type lstream

LISP_OPAQUE, object,

input or output is a Lisp_Object of type opaque

Often, the data is being converted to a ’\0’-byte-terminated string, which is the format required by many external system C APIs. For these purposes, a source type of C_STRING or a sink type of C_STRING_ALLOCA or C_STRING_MALLOC is appropriate. Otherwise, we should try to keep SXEmacs ’\0’-byte-clean, which means using (ptr, len) pairs.

The sinks to be specified must be lvalues, unless they are the lisp object types LISP_LSTREAM or LISP_BUFFER.

For the sink types ALLOCA and C_STRING_ALLOCA, the resulting text is stored in a stack-allocated buffer, which is automatically freed on returning from the function. However, the sink types MALLOC and C_STRING_MALLOC return xmalloc()ed memory. The caller is responsible for freeing this memory using xfree().

Note that it doesn’t make sense for LISP_STRING to be a source for TO_INTERNAL_FORMAT or a sink for TO_EXTERNAL_FORMAT. You’ll get an assertion failure if you try.


Next: , Previous: , Up: Coding for Mule   [Contents][Index]