Lessons learned¶

A few lessons from building v0.1 → v0.3 worth surfacing for anyone working on similar tools.

Process¶

Test-drive surfaces UX bugs that static tests can't

Every milestone has shipped 99%+ test coverage and still had bugs the test suite couldn't see. v0.2 surfaced a skill-invocation gap (the model tried to call skills as tools — which we then made it actually able to do) from a manual Fireworks run. v0.3 surfaced a Ctrl+C abort that didn't actually abort (broken three layers deep), a stdin-echo glitch during streaming, missing markdown rendering, read-panel verbosity, and an editor-exit auto-submit — all visible in 30 seconds of a real session, none caught by 549 passing tests.

Make the manual test-drive a milestone gate, not an afterthought.

Write a real consumer to validate the design doc

v0.3's extension-model design doc was 218 lines of "here's the extension surface." We could have called it documentation and stopped. Instead we wrote letscode-memory as an external-shaped plugin against the proposed stable surface. The exercise surfaced two architectural gaps (composition + namespace) that pure inspection wouldn't have found.

For any architectural design doc, write a non-trivial real-world consumer against it before declaring it done.

Defer with rationale, not silence

From v0.2 onward, every deferred item has a target version, an unblock condition, and a reason. Items deferred without rationale tend to come back as "wait, why did we drop that?" months later. Rationale is cheap insurance.

Technical¶

Multi-commit fixes are sometimes the truth

The v0.3 Ctrl+C-abort fix took three commits because the cancel path had three independent failure modes: (1) cancel-event-not-checked-while-blocked, (2) signal-handler-clobbered-by-prompt_toolkit, (3) post-cancel-exceptions-bubble-up. Each commit fixed one; the first two were useless without the third.

Squashing them into one "fix Ctrl+C" commit would have hidden the debugging story. The three-commit history is the truth of what was wrong.

Buffer-and-render beats streaming when output needs formatting

Streaming character-by-character is a vibe; rendered Markdown is utility. v0.3 chose Markdown rendering at MessageEnd over per-char streaming. For Fireworks-class latency the pause is short enough; for genuinely slow providers, concurrent prompt-and-stream via prompt_toolkit.patch_stdout shipped in v0.4 — and its three test-drive rounds taught a sharper lesson: a test-drive can reject the feature design, not just the code (the advisory-queue steering model was scrapped for interrupt-and-redirect).

When two desirable properties conflict, pick the one with the better fallback path.

PyPI namespace conflicts are release-blocking

Releases that intend to publish to PyPI are blocked on a namespace decision: rename, contact owner about transfer, or self-host. Worth checking PyPI before committing to a project name.

Architecture¶

Functional core / imperative shell pays off twice

Side effects at the edges (LLM I/O, file I/O, subprocess); pure-ish loop in the middle. This shape made every refactor cheap — the two-tier message model in v0.2, the ExecutionEnv abstraction, the per-CustomMessage-kind render hook in v0.3 all landed without rewriting the loop's contract.

Designing for hypothetical plugins inflates surface area

v0.1 shipped 9 hookspecs + 4 registries + lifecycle hooks. Only the built-in plugin used any of them. v0.2 added one more hook. v0.3 finally landed a real third-party plugin (letscode-memory) — at which point three of the seven lifecycle hooks had real users.

The lesson cuts both ways: pre-building plugin seams without consumers feels wasteful in the short term, but having them there made writing letscode-memory a one-evening exercise instead of a "first refactor the core, then write the plugin" multi-week project. Verdict: leave the seams alone, but track which ones have real users and which don't.