Internet manners – download if newer

If you’re a developer, sooner or later you’ll need to poll a website or URL, looking for information. There are good and bad uses of this, (looking at you, Ticketmaster and scalpers) and I want to be good. (This was inspired by reading Rachel’s post about her rate limiter.)

As part of my podcast speech-to-text project I am downloading a chunk of XML from their RSS feeds. These are commercial and can handle lots of traffic, but just the same why not be smart and polite about it? My goal is to have a cron job polling for new episodes, and I don’t want to cost them excess hosting and traffic fees.

The HTTP spec includes a header called Last-Modified, so your first thought might be:

  • Use the HTTP ‘head’ verb to only get metadata
  • Check the Last-Modified header to see if its newer

This will not work. ‘HEAD’ doesn’t include Last-Modified. Instead, you need the ‘ETag’ header! It’s roughly (see that link for details) a file hash, so if the RSS updates the ETag should as well. Simply save the ETag to disk and compare the version returned in the ‘HEAD’ metadata. Fast, polite, simple. Here’s the Python version from the project:

def podcast_updated(podcast: Podcast) -> bool:
    # Based on our saved last-updated time, are there new episodes? If not, don't
    # hammer their server. Internet manners. Method - call HEAD instead of GET
    # Note that HEAD doesn't include a timestamp, but does include the cache ETag, so
    # we simply snapshot the etag to disk and see if it differs.
    filename = podcast.name + '-timestamp.txt'
    try:
        r = requests.head(podcast.rss_url)
        url_etag = r.headers['ETag']
        file_etag = open(filename, 'r').read()

        if file_etag == url_etag:
            log.info(f'No new episodes found in podcast {podcast.name}')
            return False
    except FileNotFoundError:
        log.warning(f'File {filename} not found, creating.')

    open(filename, 'w').write(url_etag)
    return True

It adds a local file per URL, which is a bit messy and I need to rework the code a bit, but it’s working and runs quite fast.

Another idea would be to try the requests-cache module, which does this plus local caching of content. I’ve read their docs but not tried it yet.

P.S. Yes, I also shared this on StackOverflow. 😉

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.