Whisper and WhisperX for podcast transcription

I just realized that I hadn’t posted this. Several months ago, I read about whisper.cpp and started playing with it. To quote from their docs, whisper.cpp is a

High-performance inference of OpenAI’s Whisper automatic speech recognition (ASR) model:

  • Plain C/C++ implementation without dependencies
  • Apple Silicon first-class citizen – optimized via ARM NEON, Accelerate framework, Metal and Core ML
https://github.com/ggerganov/whisper.cpp

In other words, a fast and free speech transcription app that runs on your laptop. Damn!

In fact, it’s so efficiently written that you can transcribe on your iOS phone. Or browser. Haven’t tried those yet.

Anyway, that gave me an idea: a couple of friends of mine run a podcast called TGN. They’ve been at it for a few years, and have around 250 episodes of an hour each. Could I use whisper.cpp to produce a complete set of episode transcripts? If I did, would that be useful?

(Insert a few months of side project hacking, nights and weekends.)

It works, pretty well. For podcasts, however, you end up with a wall of text because Whisper doesn’t do what’s called ‘speaker diarization,’ that is, identifying one voice or another. It’s on their roadmap, though.

I was sharing the progress on the TGN slack when an employee of the company OctoML DM’d me. They have a WhisperX image that does diarization, and he offered to help me use it for the project.

(More nights and weekends. Me finding bugs for OctoML. Adding a second podcast. Getting help from a couple of friends, including the a-ha mkdocs idea from David.)

Voila! May I present:

The key useful bits include

  • Full text searching
  • Local mirrored copies of the podcast MP3, raw transcript text and (in progress attempts) to mirror the episode web page.
  • Using mkdocs to build static websites that look OK, and can render Markdown into HTML.

Lots of features yet to build but it’s been a really fun side project. Source code is all on GitHub, right now I’m in the prefect branch, trying out a workflow rewrite using Prefect.

Internet manners – download if newer

If you’re a developer, sooner or later you’ll need to poll a website or URL, looking for information. There are good and bad uses of this, (looking at you, Ticketmaster and scalpers) and I want to be good. (This was inspired by reading Rachel’s post about her rate limiter.)

As part of my podcast speech-to-text project I am downloading a chunk of XML from their RSS feeds. These are commercial and can handle lots of traffic, but just the same why not be smart and polite about it? My goal is to have a cron job polling for new episodes, and I don’t want to cost them excess hosting and traffic fees.

The HTTP spec includes a header called Last-Modified, so your first thought might be:

  • Use the HTTP ‘head’ verb to only get metadata
  • Check the Last-Modified header to see if its newer

This will not work. ‘HEAD’ doesn’t include Last-Modified. Instead, you need the ‘ETag’ header! It’s roughly (see that link for details) a file hash, so if the RSS updates the ETag should as well. Simply save the ETag to disk and compare the version returned in the ‘HEAD’ metadata. Fast, polite, simple. Here’s the Python version from the project:

def podcast_updated(podcast: Podcast) -> bool:
    # Based on our saved last-updated time, are there new episodes? If not, don't
    # hammer their server. Internet manners. Method - call HEAD instead of GET
    # Note that HEAD doesn't include a timestamp, but does include the cache ETag, so
    # we simply snapshot the etag to disk and see if it differs.
    filename = podcast.name + '-timestamp.txt'
    try:
        r = requests.head(podcast.rss_url)
        url_etag = r.headers['ETag']
        file_etag = open(filename, 'r').read()

        if file_etag == url_etag:
            log.info(f'No new episodes found in podcast {podcast.name}')
            return False
    except FileNotFoundError:
        log.warning(f'File {filename} not found, creating.')

    open(filename, 'w').write(url_etag)
    return True

It adds a local file per URL, which is a bit messy and I need to rework the code a bit, but it’s working and runs quite fast.

Another idea would be to try the requests-cache module, which does this plus local caching of content. I’ve read their docs but not tried it yet.

P.S. Yes, I also shared this on StackOverflow. 😉

PyCon 2018

This is, I think, my fourth? San Jose, Montreal and now Cleveland. Always a great conference and lots of enthusiastic and friendly people. The talks were a bit less expert than I’d like but it is 60% first timers now. City was good, lots of nice buildings downtown and green compared to home.

Public art is always nice.

I did the PyLadies auction again, thanks Greg for taking me in Montreal! Social anxiety is a problem and friendship helps.

Overall good. Lots of companies looking for Python developers.