- notes from the bay
- Posts
- July 28, 2025
July 28, 2025
If this whole Hollywood thing doesn’t work out for the next generation of cinematographers, there seems to be plenty of work in San Francisco. It feels like every day there’s a new viral video drop for a startup’s fundraising or some aesthetic day in the life of a founder. I know exactly how my B-Roll would look: getting up, making coffee, and walking to my office (next to my bedroom), then eventually getting dinner. Just add some sepia and call it done.
A few random things in case you missed them:
Apparently Astronomer does Apache software in addition to concert indiscretions? Their marketing org had a genius response to the scandal. Maybe no bad ideas in a brainstorm sometimes works out…
Tony couldn’t let the expansion of his North Beach rival Joe go unanswered. Mr. Tony himself just opened up a new place in SFO. All things equal I’d take a new spot in the Marina over Gate 4 but I’ll take the carbs where I can get em.
Am I the only one that thinks Claude Plays Pokemon is less impressive than convincing a random collection of internet users to win the game? Judging by how we typically interact online one is a small miracle. The other is just RL.
Now on to the nerdy stuff.
How the KV Cache works [link]

Now comes some information retrieval inspiration. When we're searching for data in a search engine, we have some query text that we input and it delivers some results. Internally most smart search engines don't just search for exact text, they're chunking your text into phrases, getting rid of suffixes, weighting by their rareness in the corpora, etc.
We can apply this same construct in attention, but instead of actually searching by words or implementing lemma chunking, we can just let all of these concepts be vectors. For each token going through this attention layer you'll have a QKV tuple of values.
We have some rough intuition that ChatGPT needs to wait for a bit (sometimes even 10s+) before generating your first token of output. After all, if you sent your prompt to a friend they would need a second to read it too. But obviously transformers aren’t actually reading in the conventional sense. We can blame the attention heads for this: there’s huge computation required to parse your input before we can even get to the output. Without inference methods like the KV Cache, you’d be waiting those 10s for every single token that you get as output. And we think latency right now is bad.
It’s not usually described with parallels to caching & information theory. So I tried to write something that explains a bit more of my intuition about it. I always find it neat when basic computer science constructs make their way into vectorized models.
Building a (kind of) invisible mac app [link]
It turns out that Cluely doesn't really work. Or at least doesn't really work all of the time. The core takeaway: macOS only gives us the power to hide windows for legacy window-capture APIs like CGWindowListCreateImage(). Modern screen recorders including Zoom, Tuple, QuickTime, and anything using ScreenCaptureKit capture the final composited display output and are unaffected by these techniques.
Cluely is this ultra viral app (raised 15M from a16z for a series A) that seems more marketing than substance. One of the key claims on their website and documentation is “being invisible to screen share.” After all, what’s the benefit of a cheating app where your interviewers can see you using it? I reproduced the methods that they’re using in Swift so we could take a look at when it succeeds and when it fails. The tldr: it only works on Chrome because Google relies on outdated APIs for now; all other screen share apps have already upgraded to the modern stack. If you share with Zoom it will fully show your Cluely overlay. False advertising if you ask me.
Long term this is still a cat-and-mouse game. You see this with meeting software that transparently record notes in violation of two-party consent states. You see it in Cluely. I’m generally still a believer that if you control the end device, you eventually have a leg up against the companies that are trying to create safeguards remotely. But at least for now, Apple seems to be thwarting Cluely’s plan.
httpx is the right way to do web requests in Python [link]
As even more webapps become shims on top of API services and those services have latency measured in seconds or minutes, the ergonomics of network requests become pretty important. The gold standard here is probably fetch() in Javascript: it's simple, powerful, and baked into most browsers. It's an industry standard for a reason.
Even though aiohttp still wins on raw throughput, my default choice has become httpx. You can nearly always find it in my pyproject.tomls. Here's why.
They say fame invites criticism. I feel like it was only last week that Tea hit the first place ranking on the App Store. (oh - because it was). For those who aren’t in the dating scene, Tea promised free background checks on men in a community verified to be only women. This identity verification was the crux of their pitch: to sign up you had to upload a photo of yourself alongside your drives license. Just as fast, a big leak dropped onto 4chan with a zip file of 72k profile photos and government IDs. There was no encryption, no deletion after processing, not even a private bucket. They even included the location data of their original user. Yikes.
Just as interestingly, the data leaker included the script that they used to do their download. And wouldn’t you notice - it used requests of all things, looping synchronously through a bucket to pull all the images. Clearly the data leaker didn’t read my post on httpx. Or maybe my case just wasn’t strong enough. I’m not sure, maybe give it a read and let me know why hackers don’t like async?
Until next time. May the tensors be good to you.
Pierce