Since slightly more than two years I'm now using my new line-breaking method. Generally, I'm happy with it, but over time, I added a couple of minor modifications to ensure that blindly doing :%!linebreak works in almost all cases.
The first observation is that there are more item-like entities I use in formatted plain text (like emails). Those new item markers are a sequence of digits and letters (but nothing else, in particular no spaces) in parentheses (think [1], (ii), (III), etc). This also allows nicely-looking and easy-to-find footnotes. Code-wise, this just means that if a line starts with such a construct, we take it as a context (with the marker replaced by spaces as prefix for the follow-up lines).
Next, I wanted to ensure that linebreak behaves idempotently. I often write an email, reformat it, and start proof reading it. Sure enough, I have to make some changes and while doing so significantly change line lengths. So I reformat again. In this second reformatting, I don't want new structural elements come out of nowhere. The typical case that used to happen was a new item occurring when an n-dash (which I usually write in TeX-style as --) got moved to the beginning of a line; German uses n-dashes with spacing where English uses m-dashes without spacing. The easiest way to avoid these unwanted new structural elements is to avoid that certain symbols are moved to the beginning of a line. I do this by essentially adding non-breakable spaces, i.e., by gluing "words" starting with one of those dangerous symbols to the word before them.
A minor tweak is to avoid too short last lines of a paragraph, which, in my opinion look weird. So, let's add a small penalty if the last line does not fill a certain fraction of the target line length.
Finally, since others got interested in this program, I added a BSD-style license. Enjoy!
-- Implementaiton of line wrapping to obtain justification, -- optimising over the whole paragraph. {- Copyright (C) 2018--2020 Klaus Aehlig. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -} import Data.List (stripPrefix) import Data.Maybe (isJust) -- * General setup -- | When evaluating how good a text is wrapped there are two kind -- of lines, those that form the block text, and the last line for -- which it is OK to be shorter. data LineType = BlockLine | LastLine -- * Line forming -- | -- Line forming is a function that takes a list of words and a line -- type and returns a line, together with the given cost. The reason -- why we abstract over the line forming is to allow, in principle, -- kerning to better fill the line type LineFormer c = LineType -> [String] -> (c, String) -- * Line wrapping -- | -- We assume a cost measure that is an ordered monoid, i.e., -- instanced Ord and Monoid, and the monoid operation is monotone -- in the sense that a <= a' -> b <= b' -> a <> b <= a' <> b'. -- -- With this premisse, a partially wrapped text is given by -- -- * the part of the text that is wrapped into block lines already, -- -- * the cost for wrapping of those parts, and -- -- * the list of the words in the last line. data PartialWrapping c = PartialWrapping [String] c [String] deriving (Eq, Show) instance Ord c => Ord (PartialWrapping c) where (PartialWrapping ls c ws) <= (PartialWrapping ls' c' ws') = (c, length ls, length ws, ls, ws) <= (c', length ls', length ws', ls', ws') emptyWrapping :: Monoid c => PartialWrapping c emptyWrapping = PartialWrapping [] mempty [] addWord :: String -> PartialWrapping c -> PartialWrapping c addWord w (PartialWrapping ls c ws) = PartialWrapping ls c (ws ++ [w]) newLine :: Monoid c => LineFormer c -> PartialWrapping c -> PartialWrapping c newLine mkLine (PartialWrapping ls c ws) = let (c', newline) = mkLine BlockLine ws in PartialWrapping (ls ++ [newline]) (c `mappend` c') [] eoi :: Monoid c => LineFormer c -> PartialWrapping c -> (c, [String]) eoi mkLine (PartialWrapping ls c ws) = let (c', lastline) = mkLine LastLine ws in (c `mappend` c', ls ++ [lastline]) -- | -- For a given list of words, the length of the last line in a partial wrapping -- also determines the words in it. For the words before the last line, we only -- care about the best wrapping, so for every length there is precisely one -- wrapping to consider. As we only care about non-empty last lines (otherwise, -- the line before would be the actual last line and should not be treated -- as block line), the first entry (at position 0) are the wrappings with one -- word in the last line, etc. type WrapOptions c = [PartialWrapping c] initialOptions :: Monoid c => String -> WrapOptions c initialOptions w = [addWord w emptyWrapping] extendOptions :: (Monoid c, Ord c) => LineFormer c -> WrapOptions c -> String -> WrapOptions c extendOptions mkLine options w = (addWord w . minimum $ map (newLine mkLine) options) : (map (addWord w) options) wrap :: (Monoid c, Ord c) => LineFormer c -> [String] -> [String] wrap mkLine [] = [] wrap mkLine (w:ws) = snd . minimum . map (eoi mkLine) $ foldl (extendOptions mkLine) (initialOptions w) ws -- * Quadratic cost function with forbiden breaks data ForbiddingCost = Infinity | Penalty Int deriving (Eq, Show) instance Ord ForbiddingCost where _ <= Infinity = True Infinity <= Penalty _ = False Penalty x <= Penalty y = x <= y instance Semigroup ForbiddingCost where Infinity <> _ = Infinity _ <> Infinity = Infinity Penalty x <> Penalty y = Penalty $ x + y instance Monoid ForbiddingCost where mempty = Penalty 0 linePenalty :: Int -- ^ hard limit on the line length -> Int -- ^ target line length -> LineType -> Bool -- ^ True, if the line is splitable -> String -- ^ the line, as string -> ForbiddingCost linePenalty hard _ _ True s | length s > hard = Infinity linePenalty _ target LastLine _ s | length s <= (target `div` lastLineMinFraction) = let delta = ((target `div` lastLineMinFraction) - (length s)) in Penalty $ 3 * delta * delta linePenalty _ target LastLine _ s | length s <= target = Penalty 0 linePenalty _ target _ _ s = let delta = (target - (length s)) in Penalty $ delta * delta -- * Paragraph context -- | Text usually appears in some context: it could be a block quote, part -- of an enumeration, quoted, etc. data ParagraphContext = ParagraphContext { pcFirstLineStart :: String , pcLineStart :: String , pcIndentRight :: Int } deriving Show emptyParagraphContext :: ParagraphContext emptyParagraphContext = ParagraphContext { pcFirstLineStart = "" , pcLineStart = "" , pcIndentRight = 0 } -- * Flat line forming stdLineLimit :: Int stdLineLimit = 80 -- Standard length of type writer, punch card, etc. -- So we should not exceed this. stdLineLength :: Int stdLineLength = 65 -- Our penalty goes symmetric towards the target length, -- so our lines are slightly longer than the target. Hence -- aim for shorter paragraphs so that most lines do not -- exceed the recommended 70 chars for email. lastLineMinFraction :: Int lastLineMinFraction = 5 -- If the last line is shorter than 1/5 of the target -- length, then take the offset to that minimal fraction -- as a penalty as well. isSplitable :: [String] -> Bool isSplitable (w:w':ws) = True isSplitable _ = False stdLine' :: Int -> Int -> ParagraphContext -> LineFormer ForbiddingCost stdLine' lLimit lLength pc lineType ws = let leftIndent = length $ pcLineStart pc hardLimit = lLimit - leftIndent target = lLength - leftIndent - (pcIndentRight pc) line = unwords ws splitable = isSplitable ws in (linePenalty hardLimit target lineType splitable line, line) stdLine = stdLine' stdLineLimit stdLineLength formatParagraph :: ParagraphContext -> [String] -> String formatParagraph pc [] = (pcFirstLineStart pc) ++ "\n" -- preserve empty lines formatParagraph pc ws = let wrappedLines = wrap (stdLine pc) ws in case wrappedLines of [] -> "" (l:ls) -> (pcFirstLineStart pc) ++ l ++ "\n" ++ (unlines $ map ((++) (pcLineStart pc)) ls) -- * Recognizing the paragraph context prefixChars :: [Char] prefixChars = ">| " itemizeChars :: [Char] itemizeChars = "-+*" -- | Given a line, split off the footnote identifier, if any footnoteChars = ['a' .. 'z'] ++ ['A' .. 'Z'] ++ ['0' .. '9'] scanFootNoteAux :: Char -> Char -> String -> Maybe (String, String) scanFootNoteAux openPar closePar s = case span (`elem` footnoteChars) s of (ident@(x:_), c:r) | c == closePar -> let (p', s') = span (== ' ') r in Just (openPar:ident ++ [closePar] ++ p', s') _ -> Nothing splitFootnote :: String -> Maybe (String, String) splitFootnote ('[':s) = scanFootNoteAux '[' ']' s splitFootnote ('(':s) = scanFootNoteAux '(' ')' s splitFootnote _ = Nothing isFootnoteLine :: String -> Bool isFootnoteLine s = let (_, s') = span (`elem` prefixChars) s in isJust $ splitFootnote s' -- | Given a line that starts a new paragraph, determine the context, and the -- payload words of that line. takeContext :: String -> (ParagraphContext, [String]) takeContext s = let (prefix, s') = span (`elem` prefixChars) s in case splitFootnote s' of Just (fn, s'') -> ( emptyParagraphContext { pcFirstLineStart = prefix ++ fn , pcLineStart = prefix ++ (map (const ' ') fn) } , words s'' ) Nothing -> let (prefix', s'') = case s' of c:c':cs | (c `elem` itemizeChars) && (c' == ' ') -> let (p,cs') = span (== ' ') (c':cs) in (prefix ++ [c] ++ p, cs') _ -> (prefix, s') rIndent = if all (== ' ') prefix' then length prefix' else 0 lStart = map (\c -> if c `elem` itemizeChars then ' ' else c) prefix' in (emptyParagraphContext { pcFirstLineStart = prefix' , pcLineStart = lStart , pcIndentRight = rIndent } , words s'' ) validLineStart :: String -> Bool validLineStart cs | all (== ' ') cs = False validLineStart s | isFootnoteLine s = False validLineStart [c] | not (c `elem` prefixChars) = True validLineStart (c:c':cs) | not (c `elem` prefixChars) && not ((c `elem` itemizeChars) && c' == ' ') = True validLineStart _ = False -- | Given a context for the first paragraph, and a text, as list of lines, -- with the first line of the first paragraph already removed, split it into -- the remaining words of the first paragraph and the rest of the text. readRestOfParagraph :: ParagraphContext -> [String] -> ([String], [String]) readRestOfParagraph pc [] = ([], []) readRestOfParagraph pc (l:ls) = maybe ([], l:ls) (\ l' -> if validLineStart l' then let (ws, ls') = readRestOfParagraph pc ls in (words l' ++ ws, ls') else ([], l:ls)) $ stripPrefix (pcLineStart pc) l -- | Certain symbols should not be at the beginning of a line; therefore join -- words starting with those to the previous one. wordJoinerLetters :: [Char] wordJoinerLetters = prefixChars ++ itemizeChars ++ ".[(" reJoinWords :: [String] -> [String] reJoinWords [] = [] reJoinWords [w] = [w] reJoinWords (w:w'@(c:_):ws) | c `elem` wordJoinerLetters = reJoinWords $ (w ++ " " ++ w'):ws reJoinWords (w:w':ws) = w : (reJoinWords $ w':ws) -- | Given the lines of a text, split it into paragraphs. Here, each paragraph -- is given by its context, and the lines it contains. paragraphs :: [String] -> [(ParagraphContext, [String])] paragraphs [] = [] paragraphs (l:ls) = let (pc, ws) = takeContext l in if ws == [] -- empty lines are always a pragraph of their own then (pc, ws) : paragraphs ls else let (ws', ls') = readRestOfParagraph pc ls in (pc, reJoinWords $ ws ++ ws') : paragraphs ls' -- * Putting things together -- | Given a (lazy) text, produce its reformated version format :: String -> String format = foldr (++) "" . map (uncurry formatParagraph) . paragraphs . lines main :: IO () main = interact format