June 25, 2025

Hey all! Pierce back in your inbox for another roundup of last week’s musings.

I was genuinely surprised by the replies to my last newsletter. I shout into the void so much on X that I sometimes forget there are a bunch of people listening. Plus, it’s hard to get buried by the latest meme when this shows up alongside your boss’ invite to a 1:1. Furthest thing possible from algorithmic propagation in there.

Thanks to everyone who shot some thoughts back. Keep the hot takes coming.

How text diffusion works [link]

Normal generative transformer architectures (GPT, Claude) are autoregressive in their training and in their inference. They proceed word by word, left to right, adding the most likely word in the sequence every time. By definition, the next token needs to be conditioned on what came before. You write a paragraph word by word - one at a time - and so should a language model.

But what if it didn't have to be one at a time?

Gemini Diffusion was one of the cooler things that came out of I/O. Google managed to train a state-of-the-art text model that uses diffusion, which up until ~4 months ago was only being taken seriously as an image generation technique. And even then it seemed like it was somewhat falling out of favor in preference to straight vision transformers.

Alongside a text writeup of how text diffusion models work, I’m trying out a new visual style for interactive walkthroughs of LLM internals. This one has three retro gaming consoles to showcase (1) how autoregressive models do inference (2) how diffusion text models are pretrained and (3) how diffusion models work during inference. I’m pretty bullish on a distill.pub-like visual framework for helping to explain academic concepts. Made all the more bullish by vibe coding lowering the barrier of entry to academics that didn’t graduate with a degree in HCI.

What I've been surprised to experience is the tenacity of modern LLMs. Give them a scoped task with some provable success criteria (ie. unit tests) and these models will keep iterating until they find a solution. Even if that means burning through a ton of context or reasoning tokens to get there.

Like clockwork whenever a new frontier model is released, I hear the same cycle of optimism and then doubt:

Day 1: Holy shit - did you see Openthropic just dropped their new model??
Day 2: Guys, you really have to try this - it’s unbelievably better than the last model I was just using
Day 10: Wait, guys, did someone just nerf the model accuracy? It seems way worse than it was just a few days ago

So much of our current impression of models is sentiment based. I usually try to be more of a numbers guy. But I’ve been having a lingering feeling that the numbers on reasoning models aren’t capturing some of their intrinsic qualities when placed in an iterative loop. In this post I spent some time exploring the general idea of tenacity in language models and arguing that they’re really a cut above what the metrics say.

(I have it on good authority that the frontier labs very infrequently switch to quantization/compression after they release a model. This nerfing seems more psychological than it is technical)

Misadventures in Python hot reloading [link]

If you've used any major Node web framework these days, they all bundle a dev server. And that dev server responds within milliseconds to changes on your code. Make a change to your flexbox and it's ready by the time you switch to your browser. Python lacks similar support out of the box.

Why are they so different?

They say that necessity is the mother of all invention. I certainly find that’s the case in open source. Unclear how true that is in a generic SaaS startup.

When my Mountaineer-powered webapp started feeling slow during development code changes, I spent a lot of time benchmarking our logic. Of all things, it was package dependency imports that were taking the longest. uvicorn only has support for process based reloading. So every time you change a file, a supervisor has to fully tear down your Python interpreter, boot up a new one, and load up all of your files. The more files you have to load the longer that’s going to take.

I spent a couple days back in March hammering out a new hot reloading library with focused support for Python’s internals. I figured this was as good a time as any to break down the internals of the importlib runtime and the motivations for the approach that Firehot took.

Until next time. May the tensors be good to you.

Pierce