LLMs can solve hard problems

We’re a couple of years into the LLM era now, and the Gartner hype cycle from last year seems relevant:

Image credit: Gartnerhttps://www.gartner.com/en/articles/what-s-new-in-artificial-intelligence-from-the-2023-gartner-hype-cycle

The purpose of this post is to share two hard problems (in the CS sense of the term) that I and a friend solved with an LLM.

The problem and impetus

I’m friends with a couple of guys who have a podcast. (I know, right?) It’s been going for seven years or more now, is information-packed and more than once I’ve wanted to be able to search for something previously mentioned. Then, via Simon Willison I think, I learned about the amazing Whisper.cpp project that can run OpenAI’s Whisper speech-to-text model at high speed on desktop hardware. As others have said, “speed is a feature” and being able to do an hour of audio in a few minutes on a MacBookPro or Mini made the project interesting and feasible.

The project and goals

The overall goal of the project was to generate transcripts of every episode, index them with a local search engine, and serve the results as a static website. Open source code, no plan to monetize or commercialize, purely for the fun of it.

The code uses Python for logic, Makefiles for orchestration, wget for web downloads, mkdocs for website generation, xmltodict for the RSS parsing, Tenacity for LLM retries and rsync to deploy code. Nothing too exciting so far.

This let us generate a decent website. However, it was quickly obvious that Whisper would not suffice, since it doesn’t indicate who is speaking, a feature known as ‘diarization.’ After sharing the work on the TGN Slack, a member offered his employers’ product as a potential improvement. WhisperX, hosted on OctoAI, includes diarization. So instead of this wall of text

you get something more like this (shown processed a bit)

Enter the LLM

So now we have roughly an hours’ worth of audio, as JSON, with labels as to speaker. But the labels are not the names of the speakers.SPEAKER_00‘ isn’t helpful. We need the names.

We need to somehow process 50-100KB of text, with all of the peculiarities of English, and extract from it the names of the speakers. Normally it’s the same two guys, but sometimes they do call-in type episodes with as many as 20 callers.

This is super hard to do with programming. I tried some crude “look for their usual intros” logic but it only worked maybe half of the time, and I didn’t want to deep-dive into NLP and parsing. At my day job, I was working on LLM-related things so it made sense to try one, but our podcasts were too large for the ChatGPT models available. Then came Claude, with 200k token windows and we could send the entire episode in a single go.

The code simply asks Claude to figure out who’s speaking. Here is the prompt:

The following is a public podcast transcript. Please write a two paragraph synopsis in a <synopsis> tag
and a JSON dictionary mapping speakers to their labels inside an <attribution> tag.
For example, {"SPEAKER_00": "Jason Heaton", "SPEAKER_01": "James"}.
If you can't determine speaker, put "Unknown".
If for any reason an answer risks reproduction of copyrighted material, explain why.

We get back the JSON dictionary, and the Python code uses that to build correct web pages. That works! Since we were paying maybe a nickel per episode, we also ask for a synopsis, another super hard programming task that LLMs can do easily. The results look like this:

Note the synopsis and the speaker labels.

LLMs are not just hype

It works great! We now have a website, with search index, that is decent looking and a usable reference. There’s more to do and work continues, but I’m super pleased and still impressed at how easy an LLM made two intractable problems. It’s not all hype; there are real, useful things you can do and I encourage you to experiment.

Lastly, please check out the website and enjoy. A complete time capsule of the podcast. I wonder if Archive.org needs a tool like this?

Links

Notes and caveats

  • This uses Claude 3.5 ‘Sonnet’, their mid-tier model. The smaller model didn’t do as well, and the top-tier cost roughly 10x as much for similar results.
  • An hour of podcast audio converted to text is about 45k to 150k tokens. No problem at all for our 200k limit.
  • Anthropic has a billing system that appears designed to defend against stolen credit cards and drive-by fraud. I had to pay a reasonable $40 and wait a week before I could do more than a few requests. For two podcasts, it took about a week to process ~600 episodes. Even at the $40 level, I hit the 2.5M limit in under a hundred episodes.
  • About one episode in ten gets flagged as a copyright violation. Which they are not. Super weird. Even more weird is that making the same call again with no delay usually fixes the error. A Tenacity one-liner suffices. As you can see, we tried to solve this in the prompt but it seems to make no difference.

Whisper and WhisperX for podcast transcription

I just realized that I hadn’t posted this. Several months ago, I read about whisper.cpp and started playing with it. To quote from their docs, whisper.cpp is a

High-performance inference of OpenAI’s Whisper automatic speech recognition (ASR) model:

  • Plain C/C++ implementation without dependencies
  • Apple Silicon first-class citizen – optimized via ARM NEON, Accelerate framework, Metal and Core ML
https://github.com/ggerganov/whisper.cpp

In other words, a fast and free speech transcription app that runs on your laptop. Damn!

In fact, it’s so efficiently written that you can transcribe on your iOS phone. Or browser. Haven’t tried those yet.

Anyway, that gave me an idea: a couple of friends of mine run a podcast called TGN. They’ve been at it for a few years, and have around 250 episodes of an hour each. Could I use whisper.cpp to produce a complete set of episode transcripts? If I did, would that be useful?

(Insert a few months of side project hacking, nights and weekends.)

It works, pretty well. For podcasts, however, you end up with a wall of text because Whisper doesn’t do what’s called ‘speaker diarization,’ that is, identifying one voice or another. It’s on their roadmap, though.

I was sharing the progress on the TGN slack when an employee of the company OctoML DM’d me. They have a WhisperX image that does diarization, and he offered to help me use it for the project.

(More nights and weekends. Me finding bugs for OctoML. Adding a second podcast. Getting help from a couple of friends, including the a-ha mkdocs idea from David.)

Voila! May I present:

The key useful bits include

  • Full text searching
  • Local mirrored copies of the podcast MP3, raw transcript text and (in progress attempts) to mirror the episode web page.
  • Using mkdocs to build static websites that look OK, and can render Markdown into HTML.

Lots of features yet to build but it’s been a really fun side project. Source code is all on GitHub, right now I’m in the prefect branch, trying out a workflow rewrite using Prefect.