2018/10/14: Line breaking, revisited

Two weeks ago, I got fed up with the simple line breaking algorithm I used till then for emails. (I use vi to write emails, so, in best UNIX tradition, line breaking is an external program, called by ":%!linebreak", or, at that time, more likely by "'a,'b!linebreak".) The main problem was not the algorithm on deciding what to put on the old and what on the new line, but the lack of context awareness (quoted text, indented text, itemize, etc). But when rewriting it anyway, I can as well do the breaking algorithm properly.

Globally optimising each paragraph

The standard algorithm for breaking text into lines is to have a cost that describes how far the line is away from the optimal line. Then we take that line breaking where the total cost is minimal. The way to compute this is to consider a graph, with each position where a line break could occur as node, and edges for all possible lines, weighted by the cost of that line (keeping in mind that the last line has a different cost measure as an underfull last line of a paragraph is not a problem). Then we compute the cheapest path from the beginning of the text to its end in a standard way. In fact, with this special graph structure, it is a simple dynamic programming, where the information we have to carry from word to word is the optimal breaking (with associated cost for the inner lines) for each possible last line.

As cost measure I use quadratic distance from the target number of characters (in the empty context I target for lines with 65 characters), disallowing (i.e., assigning infinite cost) overlong lines (more than 80 characters in the empty context) unless they consist of a single unbreakable unit (typically an extra long URL).

Context

For me, a context for a paragraph of text is given by a prefix for the first line, a prefix for the remaining lines, and a right indentation. The hard limit for the text part of a line is 80 characters minus the length of the respective prefix. The target length of the text part is 65 minus the length of the prefix and the right indentation.

The context is guessed by looking at the first line of a paragraph. Any leading greater-than and pipe symbols, as well as spaces, are considered part of a quoting prefix. If the next symbol is a plus, minus, or asterisk symbol, it, as well as any following spaces, is considered an item prefix. All those symbols so far form the prefix for the first line. The prefix for following lines is the same prefix with the item symbol replaced by a space. Additionally, if the prefix consists only of spaces, we consider the paragraph a quotation and indent the same number of symbols as the prefix from the right.

For each following line, we check if the maximal prefix is the expected one for the current paragraph. If so, we assume it belongs to the same paragraph, otherwise it starts a new paragraph. Empty lines are treated specially and always form a paragraph of their own; the number of empty lines is always preserved.

The Code

Here is the code I eventually came up with. Till now, I'm happy with it. And I always could line break the whole text, which wasn't the case before.

-- Implementaiton of line wrapping to obtain justification,
-- optimising over the whole paragraph.

import Data.List (stripPrefix)

-- * General setup

-- | When evaluating how good a text is wrapped there are two kind
-- of lines, those that form the block text, and the last line for
-- which it is OK to be shorter.

data LineType = BlockLine | LastLine

-- * Line forming

-- |
-- Line forming is a function that takes a list of words and a line
-- type and returns a line, together with the given cost. The reason
-- why we abstract over the line forming is to allow, in principle,
-- kerning to better fill the line

type LineFormer c = LineType -> [String] -> (c, String)

-- * Line wrapping

-- |
-- We assume a cost measure that is an ordered monoid, i.e.,
-- instanced Ord and Monoid, and the monoid operation is monotone
-- in the sense that a <= a' -> b <= b' -> a <> b <= a' <> b'.
--
-- With this premisse, a partially wrapped text is given by
--
-- * the part of the text that is wrapped into block lines already,
--
-- * the cost for wrapping of those parts, and
--
-- * the list of the words in the last line.

data PartialWrapping c = PartialWrapping [String] c [String] deriving (Eq, Show)

instance Ord c => Ord (PartialWrapping c) where
    (PartialWrapping ls c ws) <= (PartialWrapping ls' c' ws') =
        (c, length ls, length ws, ls, ws)
        <= (c', length ls', length ws', ls', ws')

emptyWrapping :: Monoid c => PartialWrapping c
emptyWrapping = PartialWrapping [] mempty []

addWord :: String -> PartialWrapping c -> PartialWrapping c
addWord w (PartialWrapping ls c ws) = PartialWrapping ls c (ws ++ [w])

newLine :: Monoid c => LineFormer c -> PartialWrapping c -> PartialWrapping c
newLine mkLine (PartialWrapping ls c ws) =
    let (c', newline) = mkLine BlockLine ws
    in PartialWrapping (ls ++ [newline]) (c `mappend` c') []

eoi :: Monoid c => LineFormer c -> PartialWrapping c -> (c, [String])
eoi mkLine (PartialWrapping ls c ws) =
    let (c', lastline) = mkLine LastLine ws
    in (c `mappend` c', ls ++ [lastline])

-- |
-- For a given list of words, the length of the last line in a partial wrapping
-- also determines the words in it. For the words before the last line, we only
-- care about the best wrapping, so for every length there is precisely one
-- wrapping to consider. As we only care about non-empty last lines (otherwise,
-- the line before would be the actual last line and should not be treated
-- as block line), the first entry (at position 0) are the wrappings with one
-- word in the last line, etc.
type WrapOptions c = [PartialWrapping c]

initialOptions :: Monoid c => String -> WrapOptions c
initialOptions w = [addWord w emptyWrapping]

extendOptions :: (Monoid c, Ord c)
              => LineFormer c
              -> WrapOptions c
              -> String
              -> WrapOptions c
extendOptions mkLine options w =
    (addWord w . minimum $ map (newLine mkLine) options)
    : (map (addWord w) options)

wrap :: (Monoid c, Ord c) => LineFormer c -> [String] -> [String]
wrap mkLine [] = []
wrap mkLine (w:ws) =
    snd . minimum . map (eoi mkLine) $
    foldl (extendOptions mkLine) (initialOptions w) ws


-- * Quadratic cost function with forbiden breaks

data ForbiddingCost = Infinity | Penalty Int deriving (Eq, Show)

instance Ord ForbiddingCost where
  _         <= Infinity   = True
  Infinity  <= Penalty _  = False
  Penalty x <= Penalty y  = x <= y

instance Monoid ForbiddingCost where
    mempty = Penalty 0
    Infinity `mappend` _          = Infinity
    _ `mappend` Infinity          = Infinity
    Penalty x `mappend` Penalty y = Penalty (x + y)

linePenalty :: Int      -- ^ hard limit on the line length
            -> Int      -- ^ target line length
            -> LineType
            -> Bool     -- ^ True, if the line is splitable
            -> String   -- ^ the line, as string
            -> ForbiddingCost

linePenalty hard  _        _         True s | length s > hard    = Infinity
linePenalty _     target   LastLine  _    s | length s <= target = Penalty 0
linePenalty _     target   _         _    s                      =
    let delta = (target - (length s)) in Penalty $ delta * delta

-- * Paragraph context

-- | Text usually appears in some context: it could be a block quote, part
-- of an enumeration, quoted, etc.
data ParagraphContext = ParagraphContext { pcFirstLineStart :: String
                                         , pcLineStart :: String
                                         , pcIndentRight :: Int
                                         } deriving Show

emptyParagraphContext :: ParagraphContext
emptyParagraphContext = ParagraphContext { pcFirstLineStart = ""
                                         , pcLineStart = ""
                                         , pcIndentRight = 0
                                         }

-- * Flat line forming

stdLineLimit :: Int
stdLineLimit = 80 -- Standard length of type writer, punch card, etc.
                  -- So we should not exceed this.

stdLineLength :: Int
stdLineLength = 65 -- Our penalty goes symmetric towards the target length,
                   -- so our lines are slightly longer than the target. Hence
                   -- aim for shorter paragraphs so that most lines do not
                   -- exceed the recommended 70 chars for email.

isSplitable :: [String] -> Bool
isSplitable (w:w':ws) = True
isSplitable _         = False

stdLine :: ParagraphContext
        -> LineFormer ForbiddingCost
stdLine pc lineType ws =
    let leftIndent = length $ pcLineStart pc
        hardLimit = stdLineLimit - leftIndent
        target = stdLineLength - leftIndent - (pcIndentRight pc)
        line = unwords ws
        splitable = isSplitable ws
    in (linePenalty hardLimit target lineType splitable line, line)

formatParagraph :: ParagraphContext -> [String] -> String
formatParagraph pc [] = (pcFirstLineStart pc) ++ "\n" -- preserve empty lines
formatParagraph pc ws =
    let wrappedLines = wrap (stdLine pc) ws
    in case wrappedLines of
         [] -> ""
         (l:ls) -> (pcFirstLineStart pc) ++ l ++ "\n"
                   ++ (unlines $ map ((++) (pcLineStart pc)) ls)

-- * Recognizing the paragraph context

prefixChars :: [Char]
prefixChars = ">| "

itemizeChars :: [Char]
itemizeChars = "-+*"

-- | Given a line that starts a new paragraph, determine the context, and the
-- payload words of that line.
takeContext :: String -> (ParagraphContext, [String])
takeContext s = let (prefix, s') = span (`elem` prefixChars) s
                    (prefix', s'') = case s' of
                                       c:cs | c `elem` itemizeChars ->
                                             let (p,cs') = span (== ' ') cs
                                             in (prefix ++ [c] ++ p, cs')
                                       _ -> (prefix, s')
                    rIndent = if all (== ' ') prefix' then length prefix' else 0
                    lStart = map (\c ->
                                      if c `elem` itemizeChars then ' ' else c)
                             prefix'
                in (emptyParagraphContext { pcFirstLineStart = prefix'
                                          , pcLineStart = lStart
                                          , pcIndentRight = rIndent
                                          }
                   , words s''
                   )

validLineStart :: String -> Bool
validLineStart (c:cs) | not (c `elem` prefixChars)
                        && not (c `elem` itemizeChars) = True
validLineStart _ = False

-- | Given a context for the first  paragraph, and a text, as list of lines,
-- with the first line of the first paragraph already removed, split it into
-- the remaining words of the first paragraph and the rest of the text.
readRestOfParagraph :: ParagraphContext -> [String] -> ([String], [String])
readRestOfParagraph pc [] = ([], [])
readRestOfParagraph pc (l:ls) =
    maybe ([], l:ls)
          (\ l' -> if validLineStart l'
                   then let (ws, ls') = readRestOfParagraph pc ls
                        in (words l' ++ ws, ls')
                   else ([], l:ls))
    $ stripPrefix (pcLineStart pc) l

-- | Given the lines of a text, split it into paragraphs. Here, each paragraph
-- is given by its context, and the lines it contains.
paragraphs :: [String] -> [(ParagraphContext, [String])]
paragraphs [] = []
paragraphs ("":ls) = (emptyParagraphContext, []) : paragraphs ls
  -- empty lines are always a pragraph of their own
paragraphs (l:ls) = let (pc, ws) = takeContext l
                        (ws', ls') = readRestOfParagraph pc ls
                    in (pc, ws ++ ws') : paragraphs ls'

-- * Putting things together

-- | Given a (lazy) text, produce its reformated version
format :: String -> String
format = foldr (++) ""  . map (uncurry formatParagraph) . paragraphs . lines

main :: IO ()
main = interact format

download

Cross-referenced by:

2020/12/12 More considerations on line breaking