2017/02/19: Time stamps with `pdflatex`

It is amazing, how many direct and indirect time stamps occur in the output pdf file when running pdflatex. For starters, there are the creation and modification date. Moreover, the optional ID field will also be set and, in fact, to a value also depending on the already mentioned dates. So, to get reproducible output of a pdflatex invocation, the dates need to be set in the invocation of pdlatex (by specifying something like pdflatex '\pdfinfo{/CreationDate(D:19700101000000Z)/ModDate(D:19700101000000Z)}\input{...}') and the id has to be removed or replaced by something depending only on the sources.

To make things worse, pdflatex coverts eps files included with \includegraphics on the fly to pdf (which is nice) and by doing so add those time-dependent components to each and every of those generated file which will end up as parts of the final pdf, of course with all those time stamps. So, to obtain a reproducible pdflatex rule, those generated files have to be scrubbed as well.

All in all, I ended up with the following latex rule for bazel, even for only the simple case of plain invocation without bibtex. The bzl files is a simple wrapper for a genrule.

def latex(name="", main="", srcs=[]):
    runlatex = str(Label("//aehlig_rules/latex:runpdflatex.sh"))
    allsrcs = srcs
    if main not in srcs:
        allsrcs += [main]
    native.genrule(
        name = name + "_pdf",
        srcs = allsrcs,
        cmd = "sh $(location " + runlatex +") $@ $(location " + main + ") $(SRCS)",
        outs = [name + ".pdf"],
        tools = [runlatex],
    )

download

And the mentioned runpdflatex.sh file looks as follows.

#!/bin/sh

set -eu

ROOT=`pwd`

OUT="${ROOT}/$1"
shift
ENTRY="$1"
shift
FILES="$*"

TMP_DIR=${TMPDIR:-/tmp}
WRKDIR="$(mktemp -d ${TMP_DIR%%/}/bazel.XXXXXXXX)"
trap "rm -fr \"${WRKDIR}\"" EXIT

SRCDIR="${WRKDIR}/src"

LOGFILE="${WRKDIR}/log"
touch ${LOGFILE}

echo '=== copying files ===' >> "${LOGFILE}"

SCRUB_PDF="${WRKDIR}/scrub_pdf.ed"

cat > "${SCRUB_PDF}" <<EOF
H
1
/xmp:CreateDate
s/2[0-9][0-9][0-9]-[0-2][0-9]-[0-3][0-9]T[0-2][0-9]:[0-5][0-9]:[0-5][0-9].[0-9][0-9]:[0-5][0-9]/1970-01-01T00:00:00+00:00/
/CreationDate
s/2[0-9][0-9][0-9][0-1][0-9][0-3][0-9][0-2][0-9][0-5][0-9][0-5][0-9].[0-9][0-9]/19700101000000+00/
/ModDate
s/2[0-9][0-9][0-9][0-2][0-9][0-3][0-9][0-2][0-9][0-5][0-9][0-5][0-9]/19700101000000/
/\/ID
s/<[^>]*>/<00000000000000000000000000000000>/g
w
q
EOF

DO_SCRUB_PDF="${WRKDIR}/scrub_pdf.sh"
cat > "${DO_SCRUB_PDF}" <<EOF
#!/bin/sh
echo Scrubbing \$1
ed \$1 < ${SCRUB_PDF}
EOF
chmod 755 "${DO_SCRUB_PDF}"

for file in ${FILES}
do
    DIR=$(dirname $(echo "${file}" | sed 's|^bazel-out/[^/]*/genfiles||'))
    mkdir -p "${SRCDIR}/${DIR}"
    cp $file "${SRCDIR}/${DIR}"
done

cd "${SRCDIR}"

echo '=== first latex run ===' >> "${LOGFILE}"

pdflatex "${ENTRY}" > $LOGFILE 2>&1 || (cat ${LOGFILE}; exit 1)

echo '=== second latex run ===' >> "${LOGFILE}"

pdflatex "${ENTRY}" > $LOGFILE 2>&1 || (cat ${LOGFILE}; exit 1)

echo '=== scrubing generated pdfs ===' >> "${LOGFILE}"

find . -name '*converted-to.pdf' -exec "${DO_SCRUB_PDF}" {} \; >> "${LOGFILE}" 2>&1

echo '=== final latex run ===' >> "${LOGFILE}"

pdflatex '\pdfinfo{/CreationDate(D:19700101000000Z)/ModDate(D:19700101000000Z)}\input{'"${ENTRY}"'}' > "${LOGFILE}" \
    || (cat ${LOGFILE}; exit 1)

OUTBASE=$(echo $(basename "${ENTRY}") | sed 's/.tex$//')
grep -av '^/ID \[\(<[0-9A-F]\{32\}>\) \1]$' "${SRCDIR}/${OUTBASE}.pdf" \
     > "${SRCDIR}/${OUTBASE}.pdf.without_pdf_id"

cp "${SRCDIR}/${OUTBASE}.pdf.without_pdf_id" "${OUT}"

download

Cross-referenced by:

2021/04/03 Bazel Latex Rules

2017/02/19: Time stamps with pdflatex

2017/02/19: Time stamps with `pdflatex`