When writing emails, I usually don't use Umlauts, even if I write the text in German. It seems that still today (with MIME encoding, unicode, and all that), plain 7-bit ASCII texts still work best. Instead of umlauts, I used to use the German TeX-style transcription (using "a, "o, ...), but nowadays, the people I write emails to seem to be more happy with the "crossword-puzzle spelling" (using ae, oe, ...). Of course, just replacing umlauts by the corresponding two letters is not a 1-1 mapping on arbitrary strings; however, humans are still capable of reading the text. So for emails, it usually is not a problem. But every now and then, the text of an email is to be used for something else (and if only for a joined answer together with people who like to see umlauts in German emails), so the umlauts have to be reconstructed.
Simple replacing any occurrence of ae, oe, ... by the corresponding umlaut does not work, as there are also natural occurrences of those letters in German words. And I like to emphasise that my last name does not contain any umlaut. So, for an automatic reconstruction, we have to do something slightly closer to what a human does. A human reading a German text knows the words of the German language and therefore can easily guess which word is written there. It basically never happens that both, the word with an umlaut, and the word with the replacement sequence both exist. As good word lists for German exist, we can do essentially the same: enumerate all possible preimages of the string encountered and take the one we find in the dictionary.
The rest is book keeping.
#!/usr/bin/env python3 import os import re import sys UMLAUTS = { 'Ae' : '\u00c4', 'ae' : '\u00e4', 'Oe' : '\u00d6', 'oe' : '\u00f6', 'Ue' : '\u00dc', 'ue' : '\u00fc', 'ss' : '\u00df', } # Global variable to trace the unknown words with umlaut candidates # we met during the rewriting of the text. Those need to be looked at # by hand later. UNKNOWN_CRITICAL_WORDS = set() def get_words(word_file="/usr/share/dict/ngerman", extra_file=None): """Obtain a set approximating the words used in the German language in correct spelling.""" if not extra_file: extra_file = os.path.join(os.environ['HOME'], '.ispell_ngerman') with open(word_file) as f: words = set(f.read().splitlines()) if os.path.exists(extra_file): with open(extra_file) as f: words = words.union(set(f.read().splitlines())) return words def candidates(word): def split(word): prefix, candidate, postfix = word, None, "" for k in UMLAUTS.keys(): try: i = word.index(k) if i < len(prefix): prefix, candidate, postfix = word[:i], k, word[i+len(k):] except ValueError: pass return prefix, candidate, postfix prefix, candidate, postfix = split(word) if not candidate: return [word] postfix_candidates = candidates(postfix) return ([prefix + candidate + w for w in postfix_candidates] + [prefix + UMLAUTS[candidate] + w for w in postfix_candidates]) def uncapitalize(word): if not word: return word return word[0].lower() + word[1:] def umlautify_word(word, *, words): global UNKNOWN_CRITICAL_WORDS if word in words: return word possible_spellings = candidates(word) if len(possible_spellings) == 1: return word for w in possible_spellings: if w in words or uncapitalize(w) in words: return w UNKNOWN_CRITICAL_WORDS.add(word) return word def consider_line(line): line_stripped = line.strip() if len(line_stripped) == 0: return False if line_stripped[0] in ['>', '|']: return False return True def umlautify_line(line, *, words): if not consider_line(line): return line new_line = "" for w in re.split(r'(\W+)', line): new_line += umlautify_word(w, words=words) return new_line if __name__ == "__main__": words = get_words() for line in sys.stdin: print(umlautify_line(line.rstrip('\n'), words=words)) if UNKNOWN_CRITICAL_WORDS: print("Words to be checked by hand: %s" % (UNKNOWN_CRITICAL_WORDS,), file=sys.stderr)