2021/10/10: More compact formatting (for JSON)

Operating on code or structured data, one thinks in terms of the abstract syntax tree and not in terms of the concrete serialization. So it makes sense to store a canonically-formatted pretty-printed version. Nevertheless, diffs are typically taken on the syntactical representation, and line based. So we want some form of "continuity"—similar abstract syntax trees should be pretty printed in a similar way. More precisely, we want that if two abstract syntax trees differ only in a subree, we want the pretty-printed versions to only differ in the lines containing the syntactical representation of the subtree. We might accept if the line before or after that block is changed as well (depending on how we use leading/traling commas), but larger unrelated parts should not differ.

Breaking a line after every structural element and using an indentation that only depends on the depth in the abstract syntax tree works well for JSON. However, it results in many short lines and the value spread out over many lines. A more compact representation would improve readability. Using the example from that old blog entry on JSON we would like to have it formatted something like the following.

{ "some entry":
  { "type": "FOO"
  , "short list": []
  , "longer list": ["first", "second", "third"]
  , "complicated dict":
    { "first key": "first value"
    , "second key": "second value"
    , "structued entry": {"type": "complex", "real": 0.0, "imag": 1.0}
    }
  }
, "another entry": ["this", "is", "just", "a", "long", "list"]
, "yet antoher entry":
  {"type": "BAR", "short": {}, "long": {"a": "A", "b": "B", "c": "C"}}
}

We just have to be careful to fall back to canonical formatting from the outside when an entry (like the "long" object) grows, or at least make sure to always take an indentation that only depends on the position in the abstract syntax tree.

{ "some entry":
  { "type": "FOO"
  , "short list": []
  , "longer list": ["first", "second", "third"]
  , "complicated dict":
    { "first key": "first value"
    , "second key": "second value"
    , "structued entry": {"type": "complex", "real": 0.0, "imag": 1.0}
    }
  }
, "another entry": ["this", "is", "just", "a", "long", "list"]
, "yet antoher entry":
  { "type": "BAR"
  , "short": {}
  , "long": {"a": "A", "b": "B", "c": "C", "d": "Now too long!"}
  }
}

Breaking on the inside, we might get a shorter representation, but the indentation is not the canonical one.

{ "some entry":
  { "type": "FOO"
  , "short list": []
  , "longer list": ["first", "second", "third"]
  , "complicated dict":
    { "first key": "first value"
    , "second key": "second value"
    , "structued entry": {"type": "complex", "real": 0.0, "imag": 1.0}
    }
  }
, "another entry": ["this", "is", "just", "a", "long", "list"]
, "yet antoher entry":
  {"type": "BAR", "short": {}, "long": { "a": "A", "b": "B", "c": "C"
                                       , "d": "Now too long!"}}
}

The problem with such a formatting is that the indentation is now detached from the position in the abstract syntax tree. In fact, it depends on the length of a sibling entry, like the string "BAR". Admittedly, in that example, it affects only the following line, but we're already in danger zone and adding another line like that would overdo it.

A simple implementation for compact json formatting (while respecting continuity) is to check, whenever having to write a entry, if it fits into the rest of the line; if so, we can put it there in condensed formatting, otherwise we continue with the canonical pretty printing.

Update (2022-10-29): The script is available under the terms of a BSD-style license. Enjoy!