2020/12/12: More considerations on line breaking

Since slightly more than two years I'm now using my new line-breaking method. Generally, I'm happy with it, but over time, I added a couple of minor modifications to ensure that blindly doing :%!linebreak works in almost all cases.

The first observation is that there are more item-like entities I use in formatted plain text (like emails). Those new item markers are a sequence of digits and letters (but nothing else, in particular no spaces) in parentheses (think [1], (ii), (III), etc). This also allows nicely-looking and easy-to-find footnotes. Code-wise, this just means that if a line starts with such a construct, we take it as a context (with the marker replaced by spaces as prefix for the follow-up lines).

Next, I wanted to ensure that linebreak behaves idempotently. I often write an email, reformat it, and start proof reading it. Sure enough, I have to make some changes and while doing so significantly change line lengths. So I reformat again. In this second reformatting, I don't want new structural elements come out of nowhere. The typical case that used to happen was a new item occurring when an n-dash (which I usually write in TeX-style as --) got moved to the beginning of a line; German uses n-dashes with spacing where English uses m-dashes without spacing. The easiest way to avoid these unwanted new structural elements is to avoid that certain symbols are moved to the beginning of a line. I do this by essentially adding non-breakable spaces, i.e., by gluing "words" starting with one of those dangerous symbols to the word before them.

A minor tweak is to avoid too short last lines of a paragraph, which, in my opinion look weird. So, let's add a small penalty if the last line does not fill a certain fraction of the target line length.

Finally, since others got interested in this program, I added a BSD-style license. Enjoy!

-- Implementaiton of line wrapping to obtain justification,
-- optimising over the whole paragraph.

{- Copyright (C) 2018--2020 Klaus Aehlig. All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-}

import Data.List (stripPrefix)
import Data.Maybe (isJust)

-- * General setup

-- | When evaluating how good a text is wrapped there are two kind
-- of lines, those that form the block text, and the last line for
-- which it is OK to be shorter.

data LineType = BlockLine | LastLine

-- * Line forming

-- |
-- Line forming is a function that takes a list of words and a line
-- type and returns a line, together with the given cost. The reason
-- why we abstract over the line forming is to allow, in principle,
-- kerning to better fill the line

type LineFormer c = LineType -> [String] -> (c, String)

-- * Line wrapping

-- |
-- We assume a cost measure that is an ordered monoid, i.e.,
-- instanced Ord and Monoid, and the monoid operation is monotone
-- in the sense that a <= a' -> b <= b' -> a <> b <= a' <> b'.
--
-- With this premisse, a partially wrapped text is given by
--
-- * the part of the text that is wrapped into block lines already,
--
-- * the cost for wrapping of those parts, and
--
-- * the list of the words in the last line.

data PartialWrapping c = PartialWrapping [String] c [String] deriving (Eq, Show)

instance Ord c => Ord (PartialWrapping c) where
    (PartialWrapping ls c ws) <= (PartialWrapping ls' c' ws') =
        (c, length ls, length ws, ls, ws)
        <= (c', length ls', length ws', ls', ws')

emptyWrapping :: Monoid c => PartialWrapping c
emptyWrapping = PartialWrapping [] mempty []

addWord :: String -> PartialWrapping c -> PartialWrapping c
addWord w (PartialWrapping ls c ws) = PartialWrapping ls c (ws ++ [w])

newLine :: Monoid c => LineFormer c -> PartialWrapping c -> PartialWrapping c
newLine mkLine (PartialWrapping ls c ws) =
    let (c', newline) = mkLine BlockLine ws
    in PartialWrapping (ls ++ [newline]) (c `mappend` c') []

eoi :: Monoid c => LineFormer c -> PartialWrapping c -> (c, [String])
eoi mkLine (PartialWrapping ls c ws) =
    let (c', lastline) = mkLine LastLine ws
    in (c `mappend` c', ls ++ [lastline])

-- |
-- For a given list of words, the length of the last line in a partial wrapping
-- also determines the words in it. For the words before the last line, we only
-- care about the best wrapping, so for every length there is precisely one
-- wrapping to consider. As we only care about non-empty last lines (otherwise,
-- the line before would be the actual last line and should not be treated
-- as block line), the first entry (at position 0) are the wrappings with one
-- word in the last line, etc.
type WrapOptions c = [PartialWrapping c]

initialOptions :: Monoid c => String -> WrapOptions c
initialOptions w = [addWord w emptyWrapping]

extendOptions :: (Monoid c, Ord c)
              => LineFormer c
              -> WrapOptions c
              -> String
              -> WrapOptions c
extendOptions mkLine options w =
    (addWord w . minimum $ map (newLine mkLine) options)
    : (map (addWord w) options)

wrap :: (Monoid c, Ord c) => LineFormer c -> [String] -> [String]
wrap mkLine [] = []
wrap mkLine (w:ws) =
    snd . minimum . map (eoi mkLine) $
    foldl (extendOptions mkLine) (initialOptions w) ws


-- * Quadratic cost function with forbiden breaks

data ForbiddingCost = Infinity | Penalty Int deriving (Eq, Show)

instance Ord ForbiddingCost where
  _         <= Infinity   = True
  Infinity  <= Penalty _  = False
  Penalty x <= Penalty y  = x <= y

instance Semigroup ForbiddingCost where
    Infinity <> _ = Infinity
    _ <> Infinity = Infinity
    Penalty x <> Penalty y = Penalty $ x + y

instance Monoid ForbiddingCost where
    mempty = Penalty 0

linePenalty :: Int      -- ^ hard limit on the line length
            -> Int      -- ^ target line length
            -> LineType
            -> Bool     -- ^ True, if the line is splitable
            -> String   -- ^ the line, as string
            -> ForbiddingCost

linePenalty hard  _        _         True s | length s > hard    = Infinity
linePenalty _     target   LastLine  _    s
    | length s <= (target `div` lastLineMinFraction) =
    let delta = ((target `div` lastLineMinFraction) - (length s))
    in Penalty $ 3 * delta * delta
linePenalty _     target   LastLine  _    s | length s <= target = Penalty 0
linePenalty _     target   _         _    s                      =
    let delta = (target - (length s)) in Penalty $ delta * delta

-- * Paragraph context

-- | Text usually appears in some context: it could be a block quote, part
-- of an enumeration, quoted, etc.
data ParagraphContext = ParagraphContext { pcFirstLineStart :: String
                                         , pcLineStart :: String
                                         , pcIndentRight :: Int
                                         } deriving Show

emptyParagraphContext :: ParagraphContext
emptyParagraphContext = ParagraphContext { pcFirstLineStart = ""
                                         , pcLineStart = ""
                                         , pcIndentRight = 0
                                         }

-- * Flat line forming

stdLineLimit :: Int
stdLineLimit = 80 -- Standard length of type writer, punch card, etc.
                  -- So we should not exceed this.

stdLineLength :: Int
stdLineLength = 65 -- Our penalty goes symmetric towards the target length,
                   -- so our lines are slightly longer than the target. Hence
                   -- aim for shorter paragraphs so that most lines do not
                   -- exceed the recommended 70 chars for email.

lastLineMinFraction :: Int
lastLineMinFraction = 5 -- If the last line is shorter than 1/5 of the target
                        -- length, then take the offset to that minimal fraction
                        -- as a penalty as well.

isSplitable :: [String] -> Bool
isSplitable (w:w':ws) = True
isSplitable _         = False

stdLine' :: Int
         -> Int
         -> ParagraphContext
         -> LineFormer ForbiddingCost
stdLine' lLimit lLength pc lineType ws =
    let leftIndent = length $ pcLineStart pc
        hardLimit = lLimit - leftIndent
        target = lLength - leftIndent - (pcIndentRight pc)
        line = unwords ws
        splitable = isSplitable ws
    in (linePenalty hardLimit target lineType splitable line, line)

stdLine = stdLine' stdLineLimit stdLineLength

formatParagraph :: ParagraphContext -> [String] -> String
formatParagraph pc [] = (pcFirstLineStart pc) ++ "\n" -- preserve empty lines
formatParagraph pc ws =
    let wrappedLines = wrap (stdLine pc) ws
    in case wrappedLines of
         [] -> ""
         (l:ls) -> (pcFirstLineStart pc) ++ l ++ "\n"
                   ++ (unlines $ map ((++) (pcLineStart pc)) ls)

-- * Recognizing the paragraph context

prefixChars :: [Char]
prefixChars = ">| "

itemizeChars :: [Char]
itemizeChars = "-+*"

-- | Given a line, split off the footnote identifier, if any
footnoteChars = ['a' .. 'z'] ++ ['A' .. 'Z'] ++ ['0' .. '9']

scanFootNoteAux :: Char -> Char -> String -> Maybe (String, String)
scanFootNoteAux openPar closePar s =
    case span (`elem` footnoteChars) s of
        (ident@(x:_), c:r) | c == closePar ->
          let (p', s') = span (== ' ') r
          in Just (openPar:ident ++ [closePar] ++ p', s')
        _ -> Nothing

splitFootnote :: String -> Maybe (String, String)
splitFootnote ('[':s) = scanFootNoteAux '[' ']' s
splitFootnote ('(':s) = scanFootNoteAux '(' ')' s
splitFootnote _       = Nothing

isFootnoteLine :: String -> Bool
isFootnoteLine s = let (_, s') = span (`elem` prefixChars) s
                   in isJust $ splitFootnote s'

-- | Given a line that starts a new paragraph, determine the context, and the
-- payload words of that line.
takeContext :: String -> (ParagraphContext, [String])
takeContext s =
  let (prefix, s') = span (`elem` prefixChars) s
  in case splitFootnote s' of
    Just (fn, s'') -> ( emptyParagraphContext { pcFirstLineStart = prefix ++ fn
                                              , pcLineStart =
                                                  prefix ++ (map (const ' ') fn)
                                              }
                      , words s''
                      )
    Nothing -> let (prefix', s'') = case s' of
                                       c:c':cs | (c `elem` itemizeChars)
                                                 && (c' == ' ') ->
                                                     let (p,cs') = span
                                                                   (== ' ')
                                                                   (c':cs)
                                                     in (prefix ++ [c] ++ p,
                                                                cs')
                                       _ -> (prefix, s')
                   rIndent = if all (== ' ') prefix' then length prefix' else 0
                   lStart = map (\c ->
                                     if c `elem` itemizeChars then ' ' else c)
                            prefix'
                in (emptyParagraphContext { pcFirstLineStart = prefix'
                                          , pcLineStart = lStart
                                          , pcIndentRight = rIndent
                                          }
                   , words s''
                   )

validLineStart :: String -> Bool
validLineStart cs | all (== ' ') cs = False
validLineStart s | isFootnoteLine s = False
validLineStart [c] | not (c `elem` prefixChars) = True
validLineStart (c:c':cs) | not (c `elem` prefixChars)
                           && not ((c `elem` itemizeChars) && c' == ' ') = True
validLineStart _ = False

-- | Given a context for the first  paragraph, and a text, as list of lines,
-- with the first line of the first paragraph already removed, split it into
-- the remaining words of the first paragraph and the rest of the text.
readRestOfParagraph :: ParagraphContext -> [String] -> ([String], [String])
readRestOfParagraph pc [] = ([], [])
readRestOfParagraph pc (l:ls) =
    maybe ([], l:ls)
          (\ l' -> if validLineStart l'
                   then let (ws, ls') = readRestOfParagraph pc ls
                        in (words l' ++ ws, ls')
                   else ([], l:ls))
    $ stripPrefix (pcLineStart pc) l

-- | Certain symbols should not be at the beginning of a line; therefore join
-- words starting with those to the previous one.
wordJoinerLetters :: [Char]
wordJoinerLetters =  prefixChars ++ itemizeChars ++ ".[("

reJoinWords :: [String] -> [String]
reJoinWords [] = []
reJoinWords [w] = [w]
reJoinWords (w:w'@(c:_):ws) | c `elem` wordJoinerLetters
     = reJoinWords $ (w ++ " " ++ w'):ws
reJoinWords (w:w':ws) = w : (reJoinWords $ w':ws)

-- | Given the lines of a text, split it into paragraphs. Here, each paragraph
-- is given by its context, and the lines it contains.
paragraphs :: [String] -> [(ParagraphContext, [String])]
paragraphs [] = []
paragraphs (l:ls) = let (pc, ws) = takeContext l
                    in if ws == []
                       -- empty lines are always a pragraph of their own
                       then (pc, ws) : paragraphs ls
                       else let (ws', ls') = readRestOfParagraph pc ls
                            in (pc, reJoinWords $ ws ++ ws') : paragraphs ls'

-- * Putting things together

-- | Given a (lazy) text, produce its reformated version
format :: String -> String
format = foldr (++) ""  . map (uncurry formatParagraph) . paragraphs . lines

main :: IO ()
main = interact format

download