Random bits of free software

PyThumbnail

PyThumbnail is a Python script that uses Gecko (Firefox's rendering engine) to generate a thumbnail of a web page. It is headless, meaning you can call it from the command-line or from another program and it doesn't require a running X server. It is based on Ross Burton's screenshot-tng.py, which itself derives from previous work by Matt Biddulph et al.

My main additions are the automatic launching of a VNC server, for headless / batch usage, and a guardian process to kill the main process after a timeout, so that the script won't wait forever in the case of network problems or other nuisances.

I have also written a helper script to call many instances of PyThumbnail in parallel, reusing the VNC servers.

VNC

PyThumbnail launches its own VNC server when the DISPLAY environment variable is missing. This is a great way to generate a thumbnail from cron scripts or other non-interactive facilities. VNC server creation and destruction are guarded by a lockfile, to prevent concurrent calls to the system command vncserver from two instances of PyThumbnail. This is needed because vncserver does not lock its own files and directories and will fail miserably if called more than once at the same time.

Guardian process

PyThumbnail launches a guardian process before creating the GTK window. This process will sleep for a given time, waiting to be killed by the main process as soon as the thumbnail has been generated. If the guardian returns from the sleep before the main process has completed its business, it means that something is taking more time than allowed: an error message is printed on stderr and everything is killed.

Usage

python pythumbnail.py
Usage: pythumbnail.py [-w WIDTH] [-h HEIGHT] [-o OUTPUT_FILE] URL
Will launch its own VNC server if DISPLAY environment variable is missing
Will write to standard output if the -o option is missing

PyThumbnail Launcher

Doing some batch thumbnail generation, I noticed that (unsurprisingly) half of the time was spent creating and destroying VNC servers. So I wrote PyThumbnail Launcher, another Python script that creates a pool of VNC servers and uses them to launch many concurrent PyThumbnails, to work away at a list of URLs.

On my test system, this launcher can generate 100 thumbnails of remote websites in under 1 minute, with just 10 threads.

Usage

python pythumbnail-Launcher.py
Usage: pythumbnail-launcher.py [-v] [-d TARGET_DIR] [-n N_THREADS] URLS
URLS can be provided either in the commandline or on standard input

More threads mean more parallelization, but also more resource usage, as each thread starts its own VNC server. The default is 10.

Download

pythumbnail.py (size: 4.8K; license: BSD-like; last updated on 8 Oct 2008)
pythumbnail-launcher.py (size: 3.1K; license: BSD-like; last updated on 8 Oct 2008)

Requirements

You will need a POSIX system with Python, VNC server and PyGtkMoz.

Instructions for installing the requirements on a Debian Sid system:

  1. sudo apt-get install vnc4server python-gtk2-dev libxul-dev
  2. Run vncserver in a terminal the first time, under the userid you will use to generate the thumbnails (you might want to create a system user just for that purpose); enter a password when requested, which you will not use for PyThumbnail but is still required.
  3. Kill the VNC server you just started, with vncserver -kill DISPLAY (use the display number printed to the screen in the previous step.)
  4. Remove ~/.vnc/xstartup for the PyThumbnail user and put a symbolic link to /bin/true in its place: this will strip auxiliary processes such as window managers from that user's VNC servers.
  5. Download and extract pygtkmoz-0.1.tar.gz
  6. Edit setup.py and Makefile, changing every occurrence of mozilla-gtkmozembed into xulrunner-gtkmozembed
  7. make
  8. sudo python setup.py install
This personal web site is not affiliated with, nor does it represent the views, position or attitude of my employer or of their clients.