Previous: , Up: Regular Expressions   [Contents][Index]


44.2.2 Complex Regexp Example

Here is a complicated regexp, used by SXEmacs to recognize the end of a sentence together with any whitespace that follows. It is the value of the variable sentence-end.

First, we show the regexp as a string in Lisp syntax to distinguish spaces from tab characters. The string constant begins and ends with a double-quote. ‘\"’ stands for a double-quote as part of the string, ‘\\’ for a backslash as part of the string, ‘\t’ for a tab and ‘\n’ for a newline.

"[.?!][]\"')}]*\\($\\| $\\|\t\\|  \\)[ \t\n]*"

In contrast, if you evaluate the variable sentence-end, you will see the following:

sentence-end
⇒
"[.?!][]\"')}]*\\($\\| $\\|  \\|  \\)[
]*"

In this output, tab and newline appear as themselves.

This regular expression contains four parts in succession and can be deciphered as follows:

[.?!]

The first part of the pattern is a character set that matches any one of three characters: period, question mark, and exclamation mark. The match must begin with one of these three characters.

[]\"')}]*

The second part of the pattern matches any closing braces and quotation marks, zero or more of them, that may follow the period, question mark or exclamation mark. The \" is Lisp syntax for a double-quote in a string. The ‘*’ at the end indicates that the immediately preceding regular expression (a character set, in this case) may be repeated zero or more times.

\\($\\| $\\|\t\\|  \\)

The third part of the pattern matches the whitespace that follows the end of a sentence: the end of a line, or a tab, or two spaces. The double backslashes mark the parentheses and vertical bars as regular expression syntax; the parentheses delimit a group and the vertical bars separate alternatives. The dollar sign is used to match the end of a line.

[ \t\n]*

Finally, the last part of the pattern matches any additional whitespace beyond the minimum needed to end a sentence.


Previous: , Up: Regular Expressions   [Contents][Index]