プログラマ道 (puroguramā-dō)

Wed, 06 May 2026 22:30:00 +0200

プログラマ道¹.

Companion piece to the 100x post. That one was the framing argument: AI tooling is +0.99 with a wielding bonus, not a multiplier. This is the ground-level version. How I actually drive the thing day to day, what I will not delegate, and why none of this is new.

I drive Claude every day. Across Woosmap services in Python, a music app in Go and SwiftUI on the weekends (Tunes), a SNES toolchain in Python (a816, xdds), and a fair amount of 65c816 ROM hacking. The tool is real. It does not flip the table on the discipline an engineer needs. It raises the floor a notch and rewards the wielder. What the wielder is actually doing, when it works, is the part nobody writes down.

What blindness looks like

Drop Claude into a fresh codebase and watch it grep. It reads filenames, opens files, follows imports by string match, guesses at module boundaries. On a small repo this is fine. On anything real it degrades into expensive guessing. Python projects make it worse: the source of truth for a symbol’s type is often the installed package, not the repo, and the model cannot see installed packages. So it produces plausible code against an imagined API.

The fix is not a better prompt. The fix is to give the model the same thing a human gets: a language server. Once basedpyright is wired up and Claude can ask “what is this symbol, where is it defined, what type does it return”, the questions get sharper and the answers stop being invented. The model does not need to be smarter. It needs to stop being blind.

Our internal stack runs in containers, no local virtualenv, dependencies live inside the image. A model on the host sees nothing. The fix we use internally is a sidecar container that exposes a language server with access to the real installed packages, attached to the application container. Once that is in place, Claude stops hallucinating signatures. Same model, same prompts, dramatically less drift. The intelligence was never the bottleneck. The view was.

The testbed problem, named on a SNES

ROM hacking is notoriously trial-and-error². The CPU does not care about your intent. The PPU cares even less. A bug in HDMA timing is invisible until you stare at the right cycle on the right scanline, and the only way to know you fixed it is to watch the framebuffer change. There is no type system saving you. There is no test framework that ships in the box.

So I built one. kintsuki (yes, typo of kintsugi, name stuck) embeds ares (a SNES emulator) as a C library, exposes Python bindings on top, and gives me programmatic control of the emulator: step execution, trace, read memory, write memory, hook the per-frame interrupt, dump CPU and video memory, diff framebuffers. From there I write pytest cases that drive the ROM to a known state, assert on the bytes that should have changed, and fail loudly when they did not. Regression testing for ROM hacks. With kintsuki in the loop, Claude can iterate. It edits the asm, pytest runs the ROM, the snapshot comes back, the test says green or red.

This is the pattern in general. When the model gets stuck in a loop, almost always it is not a reasoning failure. It is a feedback failure. Build the testbed before you blame the prompt. A pytest case that reproduces the bug. A shell one-liner that exercises the endpoint. A snapshot test that goes red on the broken behavior. Whatever shape it takes in your domain. Once it exists, the model converges. Same iteration loop a competent human uses. Without it, you get hallucinated success.

For the SNES work I started with Mesen Lua scripts, which is the standard answer in the romhacking community. Useful for one-off probing, painful for regression testing. Lua is interpreted inside the emulator, the test harness lived outside the emulator, and the seam between them was where bugs hid. kintsuki replaced that whole arrangement with the emulator itself as a library called from pytest. One process, one language, one trace of execution. The Mesen scripts taught me what to want. Building kintsuki was admitting that the standard tool had hit its ceiling.

TDD plays well with the model for the same reason. The red test is the bar. “Done” is defined externally instead of by whatever sounds done.

Time to serve

The number I find useful for measuring whether the tool is moving anything is time to serve³. From the moment a problem is articulated to the moment the fix is in production, end to end. Not lines of code, not commits per day, not tokens consumed.

It captures the whole pipeline in one denominator. The parts the model speeds up (typing, boilerplate, first cuts) and the parts it does not (deciding what to build, the data model, the testbed, the diff review, CI, production). If the tool is genuinely net positive, the number drops. If the model is producing more code while the rest of the pipeline absorbs the cost, the number stays flat. The metric does not flatter the tool. It measures the wielder using the tool.

Worth pairing with the token economics piece from the public post. Time to serve is the team metric today. Cost per shipped feature is the same shape once tokens stop being subsidized.

What I do not delegate

This is the 道 part⁴. The places where I refuse to hand the wheel over, regardless of how confident the model sounds.

Architecture choices the model makes silently. I have shipped diffs where the model swapped a lifecycle (background task vs request-bound, lazy vs eager init, sync vs async at a boundary) without flagging the consequence. The change was not wrong-looking. It was just a different decision than the one I asked for, embedded in code that read as ordinary. That kind of thing slips past self-review. Human PR review by someone outside the loop with the model is still load-bearing.

Platform debugging. The general-purpose memory showed the rolling inventory animating correctly while the video memory stayed wrong: the data was right, the upload to the screen was not. The hypothesis (the per-scanline DMA was pointing at the wrong background layer during the only window each frame when the picture chip lets you write to it) was not in any prompt. The model coded the patch once I gave it the algorithm. It could not have formed the algorithm from a screenshot.

Data model and naming. The model defaults to plausible names that collide in app code, or to over-typing every parameter, or to stringly-typing everything. The boundary calls are mine. So is deciding when a function should not exist at all.

The model writes code. Choosing what should exist, what it is called, what invariants it preserves, that is still the job.

Getting oversmarted

The way the model wins against me, when it does, is not by being cleverer. It is by me losing track of what got built. Five tool calls deep, a refactor I did not quite ask for slipped in alongside the fix I did ask for, tests pass, diff looks reasonable. Sign off and the surprise lands a week later.

Countermove is the same rule I use on myself. Faut faire qui marche avant que c’est beau. Works for the model too. Let it go from A to B with the ugly first cut. Don’t pre-optimize the route, don’t stop it for a paint job halfway through. Then read the diff with eyes on. If on the road I see something starting to jeopardize the destination, that is the signal: I asked for something not clear enough. The drift is feedback on my prompt, not on the model.

The dog metaphor is the cleanest. Go fetch. The dog will fetch. If you said go fetch and pointed vaguely, the dog comes back with a stick when you wanted the ball. Not the dog’s fault. Throw better, or accept whatever comes back.

Discipline is two things. Keep the destination visible to yourself the whole time, so you notice when the path bends. And reread the diff. The model is fast enough that the only person who can lose track of what got built is you.

Ask Claude to review Claude

I regularly ask the model to look at its own work with fresh eyes. It catches dead branches, redundant guards, mocks that should be fixtures, signatures that drifted from the call sites.

No costume preamble, no “you are now a senior reviewer” framing. I am not a jeu de rôle grandeur nature guy. Played RPGs as video games, plenty, but I do not run my prompts like sessions at a tabletop. Why would I. In the same week I am a programmer, a CTO, a romhacker, a JS/TS dev, a reviewer, an ops guy. The hat changes with the task, no costume needed. Claude is the same. Ask it to look again, it does.

The act of generating and the act of judging are not the same operation, and forcing the second pass cheaply pulls out the obvious mistakes before they reach a human reviewer. Not a substitute for that reviewer. A filter that respects their time.

Same workflow as with a coworker

Working with the model is not that different from working with a teammate. I look for friction, usually starting with my own pain, and I build something to smooth it. My job has been building tools for developers for a long time, and it turns out Claude hits the same walls a human dev hits: bad imports, missing types, no testbed, no probe into the running system. Solve those for the human and the model gets the fix for free.

The LSP sidecar from earlier is the cleanest example. Built it because I was losing time to the model not seeing installed packages, but the same sidecar makes any human on the team more productive in the same repos. kintsuki is the same shape: deterministic snapshots so I could reason about HDMA, and the model uses them too.

You cannot build high on crappy foundations. That has been true with humans for decades and the model does not change it. If anything it makes the foundations more visible, because the model is brutal at exposing the soft spots in your dev loop.

Hard position to defend in rooms where AI is supposed to change the world. I see it as a tool. If typing on a keyboard were the programmer’s job we would already have been replaced by typists, who can produce an order of magnitude more words per minute than any of us. We were not. The keyboard was never the bottleneck. It is not the bottleneck now either.

The 道 has been written down for thirty years

The instincts in this post are not new. Two books in particular keep mapping onto what I do with the model.

Growing Object-Oriented Software, Guided by Tests⁵ is the testbed argument with the receipts. Freeman and Pryce wrote it for human teams trying to keep design honest as a system grows. Same instincts hold when the author is a model: start from a failing test, let the test shape the interface, refactor under green. Tests as the design surface, not just the safety net.

The Pragmatic Programmer⁶ is the other. Hunt and Thomas have a chapter called Programming by Coincidence that names the failure mode I push back on hardest with the model: code that works without the author understanding why. Their tracer bullets idea is the same instinct as faut faire qui marche avant que c’est beau, fire something end to end first, see where it lands, adjust aim.

Both books predate the model by decades and apply to it without modification. The 道 was written down a long time ago. The model is the newest student in the dojo, not the founder of a new school. If a tool reframes the discipline so completely that the canon stops applying, the canon was wrong. So far, the canon is fine.

What I tell people who ask

When the model and I disagree, I am right more often than not, and the times I am wrong it is because I delegated something I should have owned. That is the calibration. Not trust the model, not distrust the model. Trust your own ability to recognize when the output is wrong, and treat the model’s confidence as decoration.

プログラマ道¹ is the same job it has always been. The tools changed. The discipline did not.

Programmer plus 道, the way or path. Same suffix as 柔道 (jūdō, the gentle way), 剣道 (kendō, the way of the sword), 茶道 (sadō, the way of tea). Practice you keep refining, not a credential you finish. ↩︎ ↩︎
SNES vocabulary used in this section, glossed once: CPU is the 65c816 main processor. PPU is the picture processing unit, a fixed-function chip that composes the picture from sprites and tile layers (not a GPU in the modern sense, no shaders, no general compute). WRAM is general-purpose memory the CPU writes to. VRAM is the PPU’s separate memory, only writable through narrow windows each frame. HDMA is a per-scanline DMA mechanism that lets you change PPU registers during display. NMI is the non-maskable interrupt that fires once per frame, the standard window for VRAM updates. BG3 is one of the four background layers the PPU composes. The point of the article does not depend on the details, but the jargon is real and so are the bugs it produces. ↩︎
Not sure who coined “time to serve”. Closest canonical sibling is DORA’s lead time for changes (Forsgren, Humble, Kim, Accelerate). I have been using time to serve internally without a clean attribution. If you know the source, tell me and I will update this footnote. ↩︎
道 on its own. The way. What stays yours after the tooling moves under your feet. ↩︎
Steve Freeman and Nat Pryce, Growing Object-Oriented Software, Guided by Tests, Addison-Wesley 2009. The “guided by tests” half is the part that ages best. Tests as the design surface, not just the safety net. ↩︎
Andy Hunt and Dave Thomas, The Pragmatic Programmer, Addison-Wesley 1999, 20th anniversary edition 2019. Programming by Coincidence and Tracer Bullets are the two chapters that map cleanest onto working with an LLM. ↩︎

Claude on Man-You