2020/11/25: Guessing Umlauts

When writing emails, I usually don't use Umlauts, even if I write the text in German. It seems that still today (with MIME encoding, unicode, and all that), plain 7-bit ASCII texts still work best. Instead of umlauts, I used to use the German TeX-style transcription (using "a, "o, ...), but nowadays, the people I write emails to seem to be more happy with the "crossword-puzzle spelling" (using ae, oe, ...). Of course, just replacing umlauts by the corresponding two letters is not a 1-1 mapping on arbitrary strings; however, humans are still capable of reading the text. So for emails, it usually is not a problem. But every now and then, the text of an email is to be used for something else (and if only for a joined answer together with people who like to see umlauts in German emails), so the umlauts have to be reconstructed.

Simple replacing any occurrence of ae, oe, ... by the corresponding umlaut does not work, as there are also natural occurrences of those letters in German words. And I like to emphasise that my last name does not contain any umlaut. So, for an automatic reconstruction, we have to do something slightly closer to what a human does. A human reading a German text knows the words of the German language and therefore can easily guess which word is written there. It basically never happens that both, the word with an umlaut, and the word with the replacement sequence both exist. As good word lists for German exist, we can do essentially the same: enumerate all possible preimages of the string encountered and take the one we find in the dictionary.

The rest is book keeping.

We don't have to worry about overlaps, as an sz never follows an s; we can just greedily consider the first replacement opportunity. (The opposite can happen, e.g., "Flossschifffahrt".)
A word can occur in the text in capitalised form, e.g., at the beginning of a sentence, or if title case is used.
If the word is in the dictionary as is, we keep it; should we really encounter a word that exists both, with umlaut and with replacement sequence, a missing umlaut is less confusing than a wrong one.
We only want to look at words, and keep all remaining characters, including repeated white space for indenting.
We only change the top-level part of the text, not quoted (but still block-quoted) passages. Who knows what is quoted, it could still be one of my German emails.
For unknown words with more than one possible preimage—and only for those—we want to inform the user. Maybe the dictionary is incomplete.

#!/usr/bin/env python3

import os
import re
import sys

UMLAUTS = { 'Ae' : '\u00c4',
            'ae' : '\u00e4',
            'Oe' : '\u00d6',
            'oe' : '\u00f6',
            'Ue' : '\u00dc',
            'ue' : '\u00fc',
            'ss' : '\u00df',
}

# Global variable to trace the unknown words with umlaut candidates
# we met during the rewriting of the text. Those need to be looked at
# by hand later.
UNKNOWN_CRITICAL_WORDS = set()

def get_words(word_file="/usr/share/dict/ngerman",
              extra_file=None):
  """Obtain a set approximating the words used in the German language
  in correct spelling."""
  if not extra_file:
    extra_file = os.path.join(os.environ['HOME'], '.ispell_ngerman')
  with open(word_file) as f:
    words = set(f.read().splitlines())
  if os.path.exists(extra_file):
    with open(extra_file) as f:
      words = words.union(set(f.read().splitlines()))
  return words

def candidates(word):
  def split(word):
    prefix, candidate, postfix = word, None, ""
    for k in UMLAUTS.keys():
      try:
        i = word.index(k)
        if i < len(prefix):
          prefix, candidate, postfix = word[:i], k, word[i+len(k):]
      except ValueError:
        pass
    return prefix, candidate, postfix

  prefix, candidate, postfix = split(word)
  if not candidate:
    return [word]

  postfix_candidates = candidates(postfix)
  return ([prefix + candidate + w for w in postfix_candidates]
          + [prefix + UMLAUTS[candidate] + w for w in postfix_candidates])

def uncapitalize(word):
  if not word:
    return word
  return word[0].lower() + word[1:]

def umlautify_word(word, *, words):
  global UNKNOWN_CRITICAL_WORDS
  if word in words:
    return word
  possible_spellings = candidates(word)
  if len(possible_spellings) == 1:
    return word
  for w in possible_spellings:
    if w in words or uncapitalize(w) in words:
      return w
  UNKNOWN_CRITICAL_WORDS.add(word)
  return word

def consider_line(line):
  line_stripped = line.strip()
  if len(line_stripped) == 0:
    return False
  if line_stripped[0] in ['>', '|']:
    return False
  return True

def umlautify_line(line, *, words):
  if not consider_line(line):
    return line
  new_line = ""
  for w in re.split(r'(\W+)', line):
    new_line += umlautify_word(w, words=words)
  return new_line

if __name__ == "__main__":
  words = get_words()
  for line in sys.stdin:
    print(umlautify_line(line.rstrip('\n'), words=words))
  if UNKNOWN_CRITICAL_WORDS:
    print("Words to be checked by hand: %s" % (UNKNOWN_CRITICAL_WORDS,),
          file=sys.stderr)

download