2020/02/01: The latin-1 hack

In the old days, everything were bytes. The contents of files were bytes, execve(2) wanted bytes, processes would write to stdout a sequence of bytes, and all parsing and string formatting functions operated on sequences of bytes. Then came unicode and in Python, most convenient string operations moved to operate on str, i.e., sequences of unicode characters. Nevertheless, inter-process communication, files, and system calls still take bytes. So we have to think about encodings. Most encodings, like utf-8 embed unicode strings into bytes, i.e., they think of bytes as the larger data structure by providing a map from unicode strings to byte strings that is 1-1, but not onto. Those kind of encodings are useful, if semantically, I have a sequence of unicode symbols (like a text) and just need a way to store them on disk. But if I start from a sequence of bytes, e.g., because I have to parse stdout of some process, this encoding does not work, as it is not onto; I would need an embedding in the other direction. I stumbled over that problem when porting my old, but still actively used, scripts for distributed use of rcs to Python3. Fortunately, there is such an embedding in the other direction, i.e., a 1-1 map from byte strings to unicode strings. It's called latin-1 encoding and simply uses the unicode code points 0–255 to represent the possible values of a byte. These are not necessarily meaningful unicode strings, but merely byte strings in disguise. On the other hand, I was porting a specification that was only talking about bytes anyway and on those disguised byte strings, all parsing and formatting functions work the same way, as the old ones used to work on byte strings before. For me, the most easy way of porting was to keep this disguise as long as possible and handle inter-process communication by functions as the following.


def str_check_output(cmdline):
    return check_output([x.encode("latin-1") for x in cmdline]).decode("latin-1")
download

The ported scripts rcsshort and rcscjoin are also online.