July 14, 2025

Pierce Freeman
July 14, 2025

The fog always comes alongside summer in San Francisco. And it’s been back in full force this week, which probably means it’s roasting hot everywhere other than the city. Thanks for keeping us cool, Karl, you’re a real one.

A few random things in case you missed them:

Soham Parekh was outed for working ~10 jobs at the same time. I wonder if he is in the running for one of Meta’s $200M pay packages.
The Giants seem to be making some M&A deals of their own, not all of which are panning out so far.
The SF Illuminati have struck again with Bar Darling. To the naked eye it’s no different from the Squat & Gobble that came before it. But now it’s… a vibe? The difference that a rebrand can make I tell you.

Now on to the nerdy stuff.

How speculative decoding works [link]

Speculative decoding speeds up inference time by 2-3x. It's one of those techniques that feels almost like cheating: you get faster inference without sacrificing quality. No kind of quantization tricks or anything. Here's how it works.

~Read me or bookmark for later~

Speculative decoding has become somewhat of a standard in hosting LLMs. It’s used by the majority of frontier labs and serving frameworks like vLLM. Much like my post a few weeks ago on Text Diffusion, I think it’s easier to visualize than to explain. I built a couple widgets to walk through how speculative decoding works at inference time.

It’s funny how some architectures become industry standards overnight and others fizzle into the depths of arxiv. I suspect it has something to do with the tradeoffs required. Flash attention won where many sub-quadratic attention approximations failed. Rope embeddings won where other positional embeddings were passed over. In most cases researchers prefer solutions where the tradeoffs are well scoped or non-existent. Anything more opinionated is more of a risk. Speculative decoding is squarely in the zero risk category.

Under the hood of Claude Code [link]

There's a lot to be learned by looking at LLM prompts from other products. The bigger companies can get more empirical about testing prompts than you can when you're first starting off. It's like an A/B test for model behavior instead of for UI design. They can tweak a prompt, measure the acceptance/rejection rate of all their users, and further tweak it from there.

~Read me or bookmark for later

Judging by the vibes floating around Twitter, people are pretty happy with Claude Code. If you haven’t yet used it yourself, Code is Anthropic’s agentic coding editor that competes with the Cursor Agent, OpenAI Codex, etc. At this point it feels like every startup has an agentic editor. You give them your full feature scope, it plans out the next steps with a todo list, researches your codebase, then drafts a feature. It only stops looping when it’s done.

I suspect half of the popularity comes from the predictable pricing. Unlike Cursor’s latest update, which makes most frontier models billed per token usage, Anthropic can subsidize Claude usage since they also own the inference stack. Shows you that one of the frontier lab benefits might be pricing power after all.

Just-in-time compilation in Mountaineer [link]

Mountaineer applies JIT principles at the web framework level instead of the interpreter level. When you tell Mountaineer that a sideeffect only modifies certain fields, it generates a specialized version of your render function at runtime - one that only computes those specific fields.

~Read me or bookmark for later~

The pursuit of computational speed is as old as Alan Turing having to crack the Enigma machine before it rotated its cypher overnight. That work used to be delegated to the compiler level, since basically every language was a compiled language. You could take human readable code and run it through an expensive conversion layer to see if there’s anything that we can do to speed it up. Slow initial compile, fast perpetual execution. Well worth the trade off.

As many companies have moved to languages like Python & Javascript, this is no longer feasible. These are interpreted languages, meaning they run immediately in an interpreter with no initial compile phase. As such we’re limited in how we can speed them up. Just in time compilation (JIT) is one technique we use to make interpreted codepaths close in speed to their fully compiled counterparts. This logic has been in Mountaineer since the beginning but relatively under documented. This post explores how we’re using the same JIT techniques to speed up webapps.

Until next time. May the tensors be good to you.

Pierce