B

Regular expressions

As TEXworks is built on Qt4, the available regular expressions—which are often referred to as regexp—are a subset of those found in Qt4. See the site of Qt41 for more information. It is possible to find other information about regexps on the net2 or from books. But pay attention that not all systems (programming languages, editors, …) use the same set of instructions; there is no “standard set”, unfortunately.

B.1 Introduction

When searching and replacing, one has to define the text to be found. This can be the text itself (e.g., “Abracadabra”), but often it is necessary to define the strings in a more generic and powerful way to avoid repeating the same operation many times with only small changes from one time to the next; if, for example, one wants to replace sequences of the letter a by ones of the letter o, but only those sequences of 3, 4, 5, 6 or 7 a; this would require repeating (and slightly adjusting) the find and replace procedure 5 times. Another example: replace all vowels by §—again, this would take 5 replace operations. Here come the regular expressions!

A simple character (a or 9) represents itself. But a set of characters can be defined: [aeiou] will match any vowel, [abcdef] the letters a, b, c, d, e, and f; this last set can be shortened as [a-f] using “-” between the two ends of the range. This can even be combined: [a-zA-Z0-9] will match all letters and all numbers.

To define a complementary set3, one uses “^”: the caret negates the character set if it occurs at the beginning, i.e., immediately after the opening square bracket. [^abc] matches anything except a, b, c.

B.2 Codes to represent special sets

When using regexps, one very often has to create a search expession which represents other strings in a generic way. If you are looking for a string that matches email addresses, for example, the letters and symbols will vary; still, you could search for any string which corresponds to the structure of an email address (<text>@<text>.<text>, roughly). To facilitate this, there are abbreviations to represent letters, figures, symbols, …

These codes replace and facilitate the definition of sets; for example, to instead of manually defining the set of digits [0-9], one can use “\d”. The following table lists the replacement codes.4

Element

Meaning

c

Any character represents itself unless it has a special regexp meaning. Thus c matches the character c.

\c

A special character that follows a backslash matches the character itself except where mentioned below. For example, if you wished to match a literal caret at the beginning of a string you would write “\^”.

\n

This matches the ASCII line feed character (LF, Unix newline, used in TEXworks).

\r

This matches the ASCII carriage return character (CR).

\t

This matches the ASCII horizontal tab character (HT).

\v

This matches the ASCII vertical tab character (VT; almost never used).

\xhhhh

This matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF). \0ooo (i.e., zero-ooo) matches the ASCII/Latin-1 character corresponding to the octal number ooo (between 0 and 0377).

. (dot)

This matches any character (including newline). So if you want to match the dot character iteself, you have to escape it with “\.”.

\d

This matches a digit.

\D

This matches a non-digit.

\s

This matches a white space.

\S

This matches a non-white space.

\w

This matches a word character or “_”).

\W

This matches a non-word character.

\1,

The n-th back-reference, e.g. \1, \2, etc.; used in the replacement string with capturing patterns—see below

Using these abbreviations is better than describing the set, because the abbreviations remain valid in different alphabets.

Pay attention that the end of line is often taken as a white space. Under TEXworks the end of line is referred to by “\n”.

B.3 Repetition

One doesn’t work only on single letters, digits, symbols; most of the time, these are repeated (e.g., a number is a repetition of digits and symbols—in the right order).

To show the number of repetitions, one uses a so called “quantifier”: a{1,1} means at least one and only one a, a{3,7} means between at least 3 and at most 7 a; {1,1} is redundant, of course, so a{1,1} = a.

This can be combined with the set notation: [0-9]{1,2} will correspond to at least one digit and at most two, the integer numbers between 0 and 99. But this will match any group of 1 or 2 digits within any arbitrary string (which may have a lot of text before and after the integer); if we want this to match only if the whole string consists entirely of 1 or 2 digits (without any other characters preceding or following them), we can rewrite the regular expression to read ^[0-9]{1,2}$; here, the ^ specifies that any match must start at the first character of the string, while the $ says that any matching substring must end at the last character of the string, so the string can only be comprised of one or two digits (^ and $ are so-called “assertions”—more on them later).

Here is a table of quantifiers.5 E represents an arbitrary expression (letter, abbreviation, set).

E{n,m}

Matches at least n occurrences of the expression and at most m occurrences of the expression.

E{n}

Matches exactly n occurrences of the expression. This is the same E{n,n} or as repeating the expression n times.

E{n,}

Matches at least n occurrences of the expression.

E{,m}

Matches at most m occurrences of the expression.

E?

Matches zero or one occurrence of E. This quantifier effectively means the expression is optional (it may be present, but doesn’t have to). It is the same as E{0,1}.

E+

Matches one or more occurrences of E. This is the same as E{1,}.

E*

Matches zero or more occurrences of E. This is the same as E{0,}. Beware, the * quantifier is often used by mistake instead of the + quantifier. Since it matches zero or more occurrences, it will match even if the expression is not present in the string.

B.4 Alternatives and assertions

When searching, it is often necessary to search for alternatives, e.g., apple, pear, or cherry, but not pineapple. To separate the alternatives, one uses |: apple|pear|cherry. But this will not prevent to find pineapple, so we have to specify that apple should be standalone, a whole word (as is often called in the search dialog boxes).

To specify that a string should be considered standalone, we specify that it is surrounded by word separators/boundaries (begin/end of sentence, space), like \bapple\b. For our alternatives example we will group them by parentheses and add the boundaries \b(apple|pear|cherry)\b. Apart from \b we have already seen ^ and $ which mark the boundaries of the whole string.

Here a table of the “assertions” which do not correspond to actual characters and will never be part of the result of a search. 6

^

The caret signifies the beginning of the string. If you wish to match a literal ^, you must escape it by writing \^

$

The dollar signifies the end of the string. If you wish to match a literal $, you must escape it by writing \$

\b

A word boundary.

\B

A non-word boundary. This assertion is true wherever \b is false.

(?=E)

Positive lookahead. This assertion is true if the expression E matches at this point.

(?!E)

Negative lookahead. This assertion is true if the expression E does not match at this point.

Notice the different meanings of ^ as assertion and as negation inside a character set!

B.5 Final notes

Using rexexp is very powerful, but also quite dangerous; you could change your text at unseen places and sometimes reverting to the previous situation is not possible entirely. If you immediately see the error, you can try CtrlZ.

Showing how to exploit the full power of regexp would require much more than this extremely short summary; in fact it would require a full manual on it own.

Also note that there are some limits in the implementation of regexps in TEXworks; in particular, the assertions (^ and $) only consider the whole file, and there are no look-behind assertions.

Finally, do not forget to “tick” the regexp option when using them in the Find and Replace dialogs and to un-tick the option when not using regexps.