Self-play training for augmented chess hit the home cluster's RAM ceiling
Augmented chess forced a self-play pipeline, and the cluster pushed back
Augmented chess re-rolls a small set of rule-changing augments per game — knight that also moves like a king, pawn that captures backward once. Stockfish does not know those rules, so the only way to have a real opponent was to train one. Three weeks in April: a five-stage self-play pipeline, a ~24M-parameter network, two OOMs, and the lesson that liveness and readiness must check different things. The engine pre-rolls augments into the event log so finished games replay from seed plus action stream — the property I leaned on in Three event-sourcing bugs, three pillars of one contract.
TL;DR
- 5 stages, 5 PRs (#350–357), 94 source files, 104 tests, ~24M params.
- First OOM:
doc-search-apiRAG pipeline leak — peak 237Mi against a 256Mi limit; fix was PR #359 scaling the Bun service to zero in favor of the Go rewrite. - Second OOM:
chess-ai-serviceidled at 484Mi against a 512Mi limit, OOMKilled 1283 times over 16 days. ws-relay-go-stagtaught the probe lesson: a single/healthwired to both probes turned a slow Mongo upstream into an infinite restart loop.- Fix shipped both ways: limits 512Mi → 1024Mi, and explicit
deleteon MCTS roots between games. A higher limit on a leaking process just delays the next page.
Five stages, one pluggable engine interface
The pipeline landed across PRs #350, #351, #352, #356, #357. @cloudnest/chess-ai-core is pure TypeScript with an AlphaZero-style 4708-slot action space (64×73 piece moves + 35 augment selections + resign) and a plugin engine interface so one MCTS loop swaps between Stockfish (UCI binary via Bun.spawn), a random baseline, and the MCTS-Neural engine wrapping the trained ONNX model.
apps/chess-ai-service is a Bun API. Concurrency is one MongoDB call:
const job = await jobs.findOneAndUpdate(
{ status: "pending" },
{ $set: { status: "claimed", workerId, claimedAt: new Date() } },
{ sort: { createdAt: 1 }, returnDocument: "after" },
);
Two workers racing the same row both run the same findOneAndUpdate; one wins, the other retries. No Redis, no broker. apps/chess-ai-worker pulls jobs and ships PGN plus per-position tensors back. training/chess-ai/ is a PyTorch transformer on pyenv 3.11.9 — 12 layers, 384 hidden, 8 heads, ~24M params, policy and value heads, AdamW, MongoDB IterableDataset, ONNX export. 123 fixtures verify TS and Python encoders produce byte-identical tensors. apps/chess-ai-dashboard is a 7-page React app for monitoring runs.
First OOM: a RAG pipeline that leaked for two days
April 4: doc-search-api hit its 256Mi limit after 2d 5h uptime. April 5: game-ws-relay-service-stag did the same, also at 2d 5h. Same shape — per-request tensors, document cache, Mongo pools, and Qdrant client buffers accumulated until steady-state crept to 237Mi (92.6%) and the kernel killed it. PR #359 scaled the Bun service to zero so the Go replacement could take over; for ws-relay-stag I bumped the limit 256Mi → 512Mi.
Liveness and readiness must check different things
The more interesting failure was ws-relay-go-stag. Its MongoDB change-stream client failed around 02:00 UTC and could not reconnect. GET /health returned 503 because csm.isHealthy() was false. That endpoint was wired to both readiness and liveness — so Kubernetes restarted the pod, the new pod failed /health for the same Mongo reason, and the loop repeated. Infinite restarts on a slow upstream.
The split, written into the runbook as PR #362:
livenessProbe:
httpGet: { path: /healthz, port: 8080 } # process alive only
readinessProbe:
httpGet: { path: /health, port: 8080 } # deps healthy (Mongo, change stream)
Liveness restarts the process. Readiness pulls a pod out of the load balancer. A slow upstream should drain traffic, not kill the pod that serves it.
Second OOM: the trade-off between hiking the limit and fixing the leak
April 22, two weeks later. chess-ai-service was OOMKilled mid-self-play. restartCount was 1283 over 16 days, usage 484Mi against a 512Mi limit (94.5%), any inference call would push it over. Pre-OOM logs were clean — Bun startup, MongoDB connect, baseline registration — so this was steady-state pressure, not a spike.
Steady-state alone — Bun runtime, MongoDB client, chess-ai-core, augmented-chess-engine, all 30 routes mounted — was ~480Mi. The 512Mi limit gave under 10% headroom. Every MCTS call allocated a fresh search tree, and per-game cached activations were not freed between games.
I shipped both fixes. Limit hike: requests 256Mi → 512Mi to match steady-state, limits 512Mi → 1024Mi for MCTS bursts. Leak fix: explicit delete on the MCTS root after each game, torch.cuda.empty_cache() between training games, periodic GC on the worker. A higher limit on a leaking process just delays the next OOM — the limit hike buys time, the leak fix earns it.
The backstop is the queue model from Stage 2. A worker OOMKilled mid-job leaves its row in claimed with a stale claimedAt; a janitor query reverts stale rows to pending and another worker picks them up. Crashes lose throughput, not data. That is the only reason I could ship to production with a known leak.
What this changes going forward
Memory limits cover steady-state plus peak burst, not whatever the test harness measured. Liveness and readiness check different things — process versus dependencies — and conflating them turns a slow upstream into a 2 a.m. restart loop. Self-play workloads are unusually leak-tolerant because the queue makes crashes recoverable, but tolerant is not free. Fix the leak.
AI workflow note
Claude drafted the five-stage breakdown via the planner agent before any code shipped, which let Stage 4 (Python training) and Stage 2 (TS API) run in parallel sub-agents without stepping on each other. The pytorch-build-resolver agent earned its keep on tensor-shape and CUDA errors during the cross-language fixture phase — encoder mismatches surface as silent shape errors twelve layers deep. The discipline that mattered most was running the worker locally for a full hour before deploying; the OOM trend was visible in top at minute 45, and I would rather see it on my laptop than in an Alertmanager page.
