Regular Expressions

Given the closure properties for languages definable by finite automata, an efficient way to denote finite automata is by using the following expressions, called regular expressions.

Defintion and Searching for Regular Expressions

More precisely, we inductively define the set of regular expressions and for each regular expression e a language L(e) as follows.

  • every letter c is a regular expression and L(c) = { c }
  • for regular e and f expressions, e|f is a regular expression and L(e|f) = L(e) union L(f)
  • for e and f regular expressions, ef is a regular expression and L(ef) = L(e)L(f)
  • for e a regular expression, e^* is a regular expression and L(e^*) = L(e)^*
  • epsilon is a regular expression and L(epsilon) = { epsilon }
  • emptyset is a regular expression and L(emptyset) = {}

The star binds strongest, then juxtaposition, then union. To have other obtain other derivation trees, subexpressions can be parenthesised with \( and \).

A language is called regular, if it is L(e) for some regular expression e.

Note. The definition of regular expressions could be more uniquely readable, if everything was properly parenthesised. However, explicit parenthesis have an additional special meaning in ed with respect to binding, so let’s try to not use too many. Also, emptyset is is not present in the set of regular expressions available for ed, and regular expression epsilon is denoted by the empty string.

It follows immediately from the observations made about finite automata that for each regular expression there is a non-deterministic automaton of size linear in the expression that recognises the denoted language.

Exercise. What is the cost (time, memory) of searching for a regular expression? Note that your answer might depend on the implementation you chose (full determination of the NFA, bit-vector construction, etc); which implementation would you build into a text editor?

Extensions

Besides the official syntax just introduced, several extensions are also used in practise (and, in particular, in ed).

  • Character-classes define a set of characters and their language consists of all one-letter words consisting of one of those characters. The syntax is [character-class] where character-class is the list of allowed letters; to include a ] it must be mentioned in first position (why can this be recognised?). Ranges can be specified using -; a - sign as first or last character stands for itself.

    Besides that, there are the standard character classes [:alnum:], [:alpha:], [:blank:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].

  • Inverted character classes where the syntax is [^character-class] specify any one-letter word with the letter not belonging to the specified set.

  • Repetition, where the syntax is one of e\{n,m\}, e\{n,\}, e\{n\} and binding is as tight as possible. They denote the regular expression e repeated n to m times, at least n times, or precisely n times, respectively.

Exercise. Prove that each of the mentioned extensions can be considered abbreviations for the official regular expressions as defined. Would this also be the way you would implement it?

Completeness of the Expressive Power of Regular Expressions

While regular expressions can nicely be translated into finite automata, one question remains: are we missing anything out? In other words, can the language of every finite automaton be described by a regular expression? It turns out that this is indeed the case.

To see this, let A be a finite automaton, without loss of generality deterministic, without loss of generality the state set is Q = { 1, 2, ..., n } for some natural number n. We define, for p and q elements of Q, and i a natural number (i.e., a non-negative integer) the sets L(p, q, i) to be the set of those words w such that reading the word w, the automaton when starting in p will end up in state q and for all states r visited strictly in between (i.e., not counting the initial p and the final q) it holds r <= i.

Exercise. Finish the proof.

Exercise. Describe the algorithm implicit in this proof. What is its complexity and the complexity of the resulting finite expression?

Exercise. Prove that the regular languages are closed under taking complement. Explain why the extended regular expressions used in practise usually nevertheless do not contain an expression for negation.

Side remark: not all languages are regular

While not relevant for our purpose, it is worth noting that not all languages are regular. One obvious argument is counting: there are only countably many regular expressions, but the set of all languages is uncountable. We can also give a more concrete example.

Lemma For every regular language L there is a natural number N such that for every x in L with at least N symbols, there are words u, v, and w, such that x = uvw, the word v is not the empty word, and for every natural number n it holds u v^n w in L.

For the proof, let A be a deterministic finite automaton and N the number of states of N. For x in L with at least N letters consider its run through A. We note that at least one state is visited at least twice.

Exercise. Finish the proof.

Exercise. Show that { a^n b^n | n a natural number } is not regular.