2018/11/18: Diffing webpages

There are a lot of web pages that only change occasionally, but if they do, I want to be informed about it. Examples include the home pages of some researchers working in the same field and release pages of software I use. For this task, I use since many years a simple script (that someone recently asked about, hence this blog entry) that keeps a copy of the web page as it was last fetched. Triggered usually by a cron job, the page gets refetched, and diff(1)'ed to the stored version; if a diff is present, it gets mailed to a specified email address.

That worked quite well (and still does) for the cases where the page to watch and diff is a hand-written static web page (or generated in a clean strait-forward way from a hand-written description). It only changes when the the contents change, and the diff of the html is meaningful. Large parts of the web changed since that time. But fortunately, I choose the data structure flexible enough by specifying that the descriptions on how to fetch the web page and on how to mail out the diff be arbitrary scripts to be executed by /bin/sh. That turned out to be handy nowadays, where for a lot of webpages it is impossible to find out if the contents changed (let alone getting a meaningful diff) without actually rendering the page (which often includes interpreting javascript), stripping out the comments section, etc. Instead of wget(1) the script now often (but not always; the simple-to-read html pages still do exist) calls edbrowse(1) where it scripting abilities allow to navigate to the correct page, copy the relevant part of the rendered version to a new buffer, and do further canonisation there. In fact, about two years ago, I changed the wwwdiff invocation adding a new URL to be watched such that it checks if the "URL" starts with a <-character (remember to quote when calling the script from the shell!) and if this is the case, the added command with be an invocation of edbrowse(1) calling the specified function, followed by a write of the current buffer to the requested output file. The actual definition of the function is in the .ebrc file, as usual.

Enough introduction. Here are the files. Feel free to use for any purpose, but no warranty whatsoever.

First there is the actual script.


#!/usr/bin/perl -w

# Simple script to get web pages and compare them with a stored version.
# Essentially, this is a glorified way to call wget(1) and diff(1).
#
# see wwwdiff(1) and wwwdiff-files(5) for a documentation

use strict;

my $home=$ENV{'HOME'};
defined $home && -d $home or $home = "";

my $wwwdiff_directory = $ENV{'WWWDIFF_DIRECTORY'}; 
defined($wwwdiff_directory) or $wwwdiff_directory = "$home/.wwwdiff";

-d $wwwdiff_directory or mkdir $wwwdiff_directory or die "Couldn't create mirror directory $wwwdiff_directory\n";

$wwwdiff_directory =~ /\/$/ and chop($wwwdiff_directory);

@ARGV == 2 and do {
    add_url($ARGV[0], $ARGV[1]);
    exit 0;
};

my $single_url = "";
my $initial_wait = $ENV{'WWWDIFF_INITIALWAIT'};
my $wait = $ENV{'WWWDIFF_WAIT'};

@ARGV == 1 and do {
    $single_url = quote($ARGV[0]);
    defined($initial_wait) or $initial_wait = 2;
    defined($wait) or $wait = 3;
};

@ARGV != 0 and @ARGV != 1 and die "Number of arguments has to be 0, 1 or 2 (URL and email).\n";

print `date`;
print "Updating mirror...\n";

defined($initial_wait) or $initial_wait = 3600;
$initial_wait = int(rand($initial_wait));

defined($wait) or $wait = 600;

print "Waiting for $initial_wait seconds to scatter start time.\n";
sleep($initial_wait);

opendir my $mdir, $wwwdiff_directory
    or die "Unable to open $wwwdiff_directory ($!)\n";

my @TASKS = readdir $mdir;

close $mdir;

foreach(@TASKS) {
    $_ eq "." and next;
    $_ eq ".." and next;
    $single_url eq "" or $_ eq $single_url or next;

    my $dir = $wwwdiff_directory . "/" . $_;

    my $dowait = int(rand($wait));

    print "Resting for $dowait seconds.\n";
    sleep($dowait);
    print "Updating $_\n";
    print `date`;
    get_lock("$dir/#lock") and next;

    system "/bin/sh", "$dir/get", "$dir/datanew"
        and do {
            print "Fetch failed ($?), giving up.\n";
            rmdir "$dir/#lock";
            next;
        };

    system "touch", "$dir/datanew";
    open(DIFF, "diff -u \Q$dir/data\E \Q$dir/datanew\E |")
        or do {
            print "couldn't fork diff ($!)\n";
            rmdir "$dir/#lock";
            next;
        };

    my $diff = do {local $/; <DIFF> };

    close(DIFF) or (not $!) or do {
        print "close of diff failed ($!)\n";
        rmdir "$dir/#lock";
        next;
    };

    $? >> 8 == 1 and do {
        print "Found a difference:\n" . $diff . "\n";
        
        open(MAIL,"| /bin/sh \Q$dir/notify\E") 
            or do {
                print "couldn't open notificaiton process ($!)\n";
                rmdir "$dir/#lock";
                next;
            };

        print MAIL $diff;

        close(MAIL) 
            or do {
                print "Notificaiton processes close failed ($!)\n";
                rmdir "$dir/#lock";
                next;
            };

    };

    system "mv", "$dir/data", "$dir/dataold";
    system "mv", "$dir/datanew", "$dir/data";

    rmdir "$dir/#lock";
}

print "Done: " . `date`;
exit 0;

######################################################################

sub add_url
{
    my ($url,$email) = @_;

    print "Will watch $url informing $email about changes.\n";
    my $dir = $wwwdiff_directory . "/" . quote($url);

    -d $dir and die "Directory $dir exists already.\nIt seems that the URL is already mirrored;" .
        " rename the directory if this is not the case.\n";

    mkdir $dir;
    
    open(GET, "> $dir/get")
        or die "Failed to open $dir/get ($!)\n";

    if ($url =~ /^</) {
      # watch an edbrowse marco
      print GET "edbrowse <<EOI\n$url\nw \$1\nqt\nEOI\n";
    } else {
      print GET "wget \Q$url\E -O \$1";
    }
    
    close(GET)
        or die "Failed to close $dir/get ($!)\n";

    open(NOTIFY, "> $dir/notify")
        or die "Failed to open $dir/notify ($!)\n";

    print NOTIFY "mailx -s \QUpdate of $url\E \Q$email\E";

    close(NOTIFY)
        or die "Failed to open $dir/notify ($!)\n";

    print "Administrative files written. Will fetch to get initial state\n";

    system "/bin/sh", "$dir/get", "$dir/datanew"
        and print "Warning: fetch failed! ($?)\n";

    system "cp", "$dir/datanew", "$dir/dataold"
        and print "Waring: cp failed. ($?)\n";

    system "mv", "$dir/datanew", "$dir/data"
        and print "Waring: cp failed. ($?)\n";

}

# try to crete a new directory $lock; if not wait
# and trie another few times.
sub get_lock
{
    my ($lock) = @_;

    my $retries = 4;

    for($retries--;$retries>=0;$retries--) {
        mkdir $lock and return 0;
        print "Failed to create lock $lock ($!)\nWill try another $retries times.\n";
        if ($retries) {
            my $waitlock = 10+int(rand(60));
            print "Wating $waitlock seconds before retrying again.\n";
            sleep $waitlock;
        }
    }
    print "Giving up on getting lock $lock.\n";
    return 1;
}

sub quote
{
    my ($name) = @_;

    $name =~ s/([^-a-zA-Z0-9])/"_" . ord ($1) . "_" /eg;
    return $name;
}
download

Then there is the man page.
.TH WWWDIFF 1 "May 4, 2008" "" ""
.SH NAME
wwwdiff - monitor web pages for changes

.SH SYNOPSIS
\fBwwwdiff\fR
.br
\fBwwwdiff <URL>\fR
.br
\fBwwwdiff <URL> <email>\fR
.br

.SH DESCRIPTION
wwwdiff is a tool to keep local copies of selected web pages and
monitor them for changes. If called without any arguments, it just
updates its copy of its mirror and sends notifications where
needed. The file structure is described in wwwdiff-files(1). If called
with single argument, it updates the copy of that URL only; also, the
initial wait period is shortened to a few seconds.

In the last way of calling, the new URL is entered into the database
with the given email as notification address. If the URL starts
with '<', i.e., a less-than symbol, the URL is assumed to be an
edbrowse(1) macro that will end up in a buffer containing the
conntent to be watched.

Typically, the wwwdiff without arguments is called by cron(8) on a
regular basis, say once a week. As it is intended to be run
non-interactively, it leaves lots of breaks of random length before
each individual fetch to avoid hammering to the servers being
monitored. 

.SH ENVIRONMENT VARIABLES

\fBWWWDIFF_DIRECTORY\fR
.br
This variable contains the location of the mirror directory; see 
wwwdiff-files(1) for a description on how the directory has
to look like. If this environment variable is not specified the
default "~/.wwwdiff" is taken.

\fBWWWDIFF_INITIALWAIT\fR
The time to wait after being called before starting the actual
mirroring. After wwwdiff is called without arguments, a random time
period between 0 and WWWDIFFF_INITIALWAIT seconds is waited for in
order to scatter start times. Default value is 3600.

\fBWWWDIFF_WAIT\fR
The time to wait before fetching a new site.
A random time period between 0 and WWWDIFFF_WAIT is waited for.
Default value is 600.


.SH FILES
\fB~/.wwwdiff/\fR
.br
The directory where the local copies are stored. See wwwdiff-files(1) for
details.

.SH SEE ALSO
.BR wwwdiff-files "(5), "cron "(8), "crontab "(1), "wget "(1), "mailx "(1)"

.SH AUTHOR
Klaus Aehlig <aehlig@linta.de>

download

And, of course, there is a man page describing the on-disk data structure.
.TH WWWDIFF-FILES 5 "May 4, 2008" "" ""
.SH NAME
Specification of the file structure in a wwwdiff directory (typically
~/.wwwdiff).

.SH DESCRIPTION
At the designated place (default ~/.wwwdiff) there are several directories,
each is the quoted (see below) name of an URL. Inside each of the directories,
the files have the following meaning.

\fBget\fR
Any form of script that, when executed by /bin/sh, will fetch the said
URL into the file given as argument to the script. In this way, a
canonisation of the content can be carried out.

\fBdata\fR
.br
The contents of this URL, as read the last time.

\fBnotify\fR
.br
The place where to send the diff(1), if it was discovered that
the content of the URL has changed. This file is a script that will be
executed by /bin/sh and should expect the diff at stdin.

\fBdatanew\fR
.br
A temporary file that is filled with the new data while fetching is in
progress.

\fBdataold\fR
A copy of page as it looked before the last update.

\fB#lock\fR
.br
A lock directory.

.SH QUOTING
A simple quoting mechanism is used. All but letters (a-zA-Z) and digits (0-9) are replaced by 
an underscore, followed by the decimal code of that character followed by another underscore. 
This way of quoting is a 1-1 function. Note that, in particular, "_" is quoted as "_95_".

.SH LOCKS
The existence of a directory #lock in ~/.wwwdiff/<quoted_name> signifies
that this directory is locked. To obtain
a lock, try to create the directory by an atomic operation; if
this fails, wait an try again later. 
To release a lock remove the directory #lock.

.SH TO UPDATE
To update the mirrored page, first obtain the lock. Then fetch the
contents of the URL to the file datanew. Do the comparison and, if
necessary, send the notification. Then rename data to
dataold. Afterwards rename datanew to data. Finally release the lock.


.SH SEE ALSO
.BR wwwdiff "(1), "sh "(1), "diff "(1)"

.SH AUTHOR
Klaus Aehlig <aehlig@linta.de>
download



Cross-referenced by: