2020/11/25: Guessing Umlauts

When writing emails, I usually don't use Umlauts, even if I write the text in German. It seems that still today (with MIME encoding, unicode, and all that), plain 7-bit ASCII texts still work best. Instead of umlauts, I used to use the German TeX-style transcription (using "a, "o, ...), but nowadays, the people I write emails to seem to be more happy with the "crossword-puzzle spelling" (using ae, oe, ...). Of course, just replacing umlauts by the corresponding two letters is not a 1-1 mapping on arbitrary strings; however, humans are still capable of reading the text. So for emails, it usually is not a problem. But every now and then, the text of an email is to be used for something else (and if only for a joined answer together with people who like to see umlauts in German emails), so the umlauts have to be reconstructed.

Simple replacing any occurrence of ae, oe, ... by the corresponding umlaut does not work, as there are also natural occurrences of those letters in German words. And I like to emphasise that my last name does not contain any umlaut. So, for an automatic reconstruction, we have to do something slightly closer to what a human does. A human reading a German text knows the words of the German language and therefore can easily guess which word is written there. It basically never happens that both, the word with an umlaut, and the word with the replacement sequence both exist. As good word lists for German exist, we can do essentially the same: enumerate all possible preimages of the string encountered and take the one we find in the dictionary.

The rest is book keeping.

#!/usr/bin/env python3

import os
import re
import sys

UMLAUTS = { 'Ae' : '\u00c4',
            'ae' : '\u00e4',
            'Oe' : '\u00d6',
            'oe' : '\u00f6',
            'Ue' : '\u00dc',
            'ue' : '\u00fc',
            'ss' : '\u00df',

# Global variable to trace the unknown words with umlaut candidates
# we met during the rewriting of the text. Those need to be looked at
# by hand later.

def get_words(word_file="/usr/share/dict/ngerman",
  """Obtain a set approximating the words used in the German language
  in correct spelling."""
  if not extra_file:
    extra_file = os.path.join(os.environ['HOME'], '.ispell_ngerman')
  with open(word_file) as f:
    words = set(f.read().splitlines())
  if os.path.exists(extra_file):
    with open(extra_file) as f:
      words = words.union(set(f.read().splitlines()))
  return words

def candidates(word):
  def split(word):
    prefix, candidate, postfix = word, None, ""
    for k in UMLAUTS.keys():
        i = word.index(k)
        if i < len(prefix):
          prefix, candidate, postfix = word[:i], k, word[i+len(k):]
      except ValueError:
    return prefix, candidate, postfix

  prefix, candidate, postfix = split(word)
  if not candidate:
    return [word]

  postfix_candidates = candidates(postfix)
  return ([prefix + candidate + w for w in postfix_candidates]
          + [prefix + UMLAUTS[candidate] + w for w in postfix_candidates])

def uncapitalize(word):
  if not word:
    return word
  return word[0].lower() + word[1:]

def umlautify_word(word, *, words):
  if word in words:
    return word
  possible_spellings = candidates(word)
  if len(possible_spellings) == 1:
    return word
  for w in possible_spellings:
    if w in words or uncapitalize(w) in words:
      return w
  return word

def consider_line(line):
  line_stripped = line.strip()
  if len(line_stripped) == 0:
    return False
  if line_stripped[0] in ['>', '|']:
    return False
  return True

def umlautify_line(line, *, words):
  if not consider_line(line):
    return line
  new_line = ""
  for w in re.split(r'(\W+)', line):
    new_line += umlautify_word(w, words=words)
  return new_line

if __name__ == "__main__":
  words = get_words()
  for line in sys.stdin:
    print(umlautify_line(line.rstrip('\n'), words=words))
    print("Words to be checked by hand: %s" % (UNKNOWN_CRITICAL_WORDS,),