We’re a couple of years into the LLM era now, and the Gartner hype cycle from last year seems relevant:

The purpose of this post is to share two hard problems (in the CS sense of the term) that I and a friend solved with an LLM.
The problem and impetus
I’m friends with a couple of guys who have a podcast. (I know, right?) It’s been going for seven years or more now, is information-packed and more than once I’ve wanted to be able to search for something previously mentioned. Then, via Simon Willison I think, I learned about the amazing Whisper.cpp project that can run OpenAI’s Whisper speech-to-text model at high speed on desktop hardware. As others have said, “speed is a feature” and being able to do an hour of audio in a few minutes on a MacBookPro or Mini made the project interesting and feasible.
The project and goals
The overall goal of the project was to generate transcripts of every episode, index them with a local search engine, and serve the results as a static website. Open source code, no plan to monetize or commercialize, purely for the fun of it.
The code uses Python for logic, Makefiles for orchestration, wget for web downloads, mkdocs for website generation, xmltodict for the RSS parsing, Tenacity for LLM retries and rsync to deploy code. Nothing too exciting so far.
This let us generate a decent website. However, it was quickly obvious that Whisper would not suffice, since it doesn’t indicate who is speaking, a feature known as ‘diarization.’ After sharing the work on the TGN Slack, a member offered his employers’ product as a potential improvement. WhisperX, hosted on OctoAI, includes diarization. So instead of this wall of text

you get something more like this (shown processed a bit)

Enter the LLM
So now we have roughly an hours’ worth of audio, as JSON, with labels as to speaker. But the labels are not the names of the speakers. ‘SPEAKER_00‘ isn’t helpful. We need the names.
We need to somehow process 50-100KB of text, with all of the peculiarities of English, and extract from it the names of the speakers. Normally it’s the same two guys, but sometimes they do call-in type episodes with as many as 20 callers.
This is super hard to do with programming. I tried some crude “look for their usual intros” logic but it only worked maybe half of the time, and I didn’t want to deep-dive into NLP and parsing. At my day job, I was working on LLM-related things so it made sense to try one, but our podcasts were too large for the ChatGPT models available. Then came Claude, with 200k token windows and we could send the entire episode in a single go.
The code simply asks Claude to figure out who’s speaking. Here is the prompt:
The following is a public podcast transcript. Please write a two paragraph synopsis in a <synopsis> tag
and a JSON dictionary mapping speakers to their labels inside an <attribution> tag.
For example, {"SPEAKER_00": "Jason Heaton", "SPEAKER_01": "James"}.
If you can't determine speaker, put "Unknown".
If for any reason an answer risks reproduction of copyrighted material, explain why.
We get back the JSON dictionary, and the Python code uses that to build correct web pages. That works! Since we were paying maybe a nickel per episode, we also ask for a synopsis, another super hard programming task that LLMs can do easily. The results look like this:

Note the synopsis and the speaker labels.
LLMs are not just hype
It works great! We now have a website, with search index, that is decent looking and a usable reference. There’s more to do and work continues, but I’m super pleased and still impressed at how easy an LLM made two intractable problems. It’s not all hype; there are real, useful things you can do and I encourage you to experiment.
Lastly, please check out the website and enjoy. A complete time capsule of the podcast. I wonder if Archive.org needs a tool like this?
Links
- The resulting websites (I added a second podcast mid-project) are at https://tgn.phfactor.net/ and https://wcl.phfactor.net/
- Source code is at https://github.com/phubbard/tgn-whisperer If you peruse the commit history, the older code uses local Whisper.cpp and later switches to WhisperX.
- Big thanks to PyCharm and JetBrains for the FOSS license for the project. PyCharm is my favorite IDE, and the Pro version has things like usable CSV plugins plus I added a GitHub Copilot subscription as well.
- Big thanks to OctoAI for the donated GPU hours while we worked out the bugs with their WhisperX hosting. (To be clear, once those were fixed I now pay as a normal customer.)
Notes and caveats
- This uses Claude 3.5 ‘Sonnet’, their mid-tier model. The smaller model didn’t do as well, and the top-tier cost roughly 10x as much for similar results.
- An hour of podcast audio converted to text is about 45k to 150k tokens. No problem at all for our 200k limit.
- Anthropic has a billing system that appears designed to defend against stolen credit cards and drive-by fraud. I had to pay a reasonable $40 and wait a week before I could do more than a few requests. For two podcasts, it took about a week to process ~600 episodes. Even at the $40 level, I hit the 2.5M limit in under a hundred episodes.
- About one episode in ten gets flagged as a copyright violation. Which they are not. Super weird. Even more weird is that making the same call again with no delay usually fixes the error. A Tenacity one-liner suffices. As you can see, we tried to solve this in the prompt but it seems to make no difference.
