2015/11/09: Strings and Strictness Patterns

In Haskell standard strings are lists of characters, and, as such, take a lot of memory. However,laziness usually hides this quite well: at any one moment, most parts of the string are either not computed yet, or not needed (and hence garbage collected) any more.

Consider the following trivial example of writing out a 2 MiB string on which we do a trivial local computation; as expected the memory profile is flat, as we, basically, only have to remember how many more characters we have to process.

import Data.Char (toUpper)

main :: IO ()
main = do
  let longstring = take (2 * 1024 * 1024) $ repeat 'a'
      s' = map toUpper longstring
  putStrLn s'

download

Things change radically, if we look at the string first; then we have to remember every character of string, as we need it later in order to print it. The memory usage increases significantly.

import Data.Char (toUpper)

main :: IO ()
main = do
  let longstring = take (2 * 1024 * 1024) $ repeat 'a'
      s' = map toUpper longstring
  putStrLn . show $ any ((==) 'b') s'
  putStrLn s'

download

With this in mind, let's look at the strictness behaviour of Text.JSON, a JSON library that uses strings as serialisation format. As expected, serialising 100000 floating point numbers to JSON (produces about 1.8M of output) has a flat memory profile, even if we ensure all those numbers to be present in the first place.

import qualified Text.JSON as J

-- | generate a list of numbers, roughly the given length
numbers :: Double -> [Double]
numbers n | n > 0.0 = sqrt(n) : numbers (n - 1.0)
          | otherwise = []

main :: IO ()
main = do
  let vals = numbers 100000
  putStrLn . show $ any (< 0) vals
  putStrLn $ J.encode vals

download

Before we start thinking about deserialisation, we first need a way to have the input in memory in an efficient way. Byte strings do this, and allow to generate a lazy string from it. Therefore, to just handle strings, we basically need the 1.8M of memory needed to store the input.

import qualified Data.ByteString as BS
import qualified Data.ByteString.UTF8 as UTF8

main :: IO ()
main = do
  bs <- BS.getContents
  let s = UTF8.toString bs
  putStrLn s

download

Nevertheless, parsing that strings inspects all characters before a relevant part of them gets out of scope—thus the big memory spike.

import qualified Data.ByteString as BS
import qualified Data.ByteString.UTF8 as UTF8
import qualified Text.JSON as J

main :: IO ()
main = do
  bs <- BS.getContents
  let s = UTF8.toString bs
  let vals = case (J.decode s :: J.Result [Double]) of
                    J.Ok xs -> xs
                    _ -> [0.0]
  putStrLn . show $ any (< 0.0) vals

download

And note that the memory in the diagram is only the tip of the iceberg, due to the slow sampling rate.

$ cat data | ./json +RTS -hy -M70M
json: Heap exhausted;
Current maximum heap size is 73400320 bytes (70 MB);
use `+RTS -M<size>' to increase it.
$