Self-play training for augmented chess hit the home cluster's RAM ceiling — Wintersalmon

Augmented chess forced a self-play pipeline, and the cluster pushed back

Augmented chess re-rolls a small set of rule-changing augments per game — knight that also moves like a king, pawn that captures backward once. Stockfish does not know those rules, so the only way to have a real opponent was to train one. Three weeks in April: a five-stage self-play pipeline, a ~24M-parameter network, two OOMs, and the lesson that liveness and readiness must check different things. The engine pre-rolls augments into the event log so finished games replay from seed plus action stream — the property I leaned on in Three event-sourcing bugs, three pillars of one contract.

TL;DR

5 stages, 5 PRs (#350–357), 94 source files, 104 tests, ~24M params.
First OOM: doc-search-api RAG pipeline leak — peak 237Mi against a 256Mi limit; fix was PR #359 scaling the Bun service to zero in favor of the Go rewrite.
Second OOM: chess-ai-service idled at 484Mi against a 512Mi limit, OOMKilled 1283 times over 16 days.
ws-relay-go-stag taught the probe lesson: a single /health wired to both probes turned a slow Mongo upstream into an infinite restart loop.
Fix shipped both ways: limits 512Mi → 1024Mi, and explicit delete on MCTS roots between games. A higher limit on a leaking process just delays the next page.

Five stages, one pluggable engine interface

The pipeline landed across PRs #350, #351, #352, #356, #357. @cloudnest/chess-ai-core is pure TypeScript with an AlphaZero-style 4708-slot action space (64×73 piece moves + 35 augment selections + resign) and a plugin engine interface so one MCTS loop swaps between Stockfish (UCI binary via Bun.spawn), a random baseline, and the MCTS-Neural engine wrapping the trained ONNX model.

apps/chess-ai-service is a Bun API. Concurrency is one MongoDB call:

const job = await jobs.findOneAndUpdate(
  { status: "pending" },
  { $set: { status: "claimed", workerId, claimedAt: new Date() } },
  { sort: { createdAt: 1 }, returnDocument: "after" },
);

Two workers racing the same row both run the same findOneAndUpdate; one wins, the other retries. No Redis, no broker. apps/chess-ai-worker pulls jobs and ships PGN plus per-position tensors back. training/chess-ai/ is a PyTorch transformer on pyenv 3.11.9 — 12 layers, 384 hidden, 8 heads, ~24M params, policy and value heads, AdamW, MongoDB IterableDataset, ONNX export. 123 fixtures verify TS and Python encoders produce byte-identical tensors. apps/chess-ai-dashboard is a 7-page React app for monitoring runs.

First OOM: a RAG pipeline that leaked for two days

April 4: doc-search-api hit its 256Mi limit after 2d 5h uptime. April 5: game-ws-relay-service-stag did the same, also at 2d 5h. Same shape — per-request tensors, document cache, Mongo pools, and Qdrant client buffers accumulated until steady-state crept to 237Mi (92.6%) and the kernel killed it. PR #359 scaled the Bun service to zero so the Go replacement could take over; for ws-relay-stag I bumped the limit 256Mi → 512Mi.

Liveness and readiness must check different things

The more interesting failure was ws-relay-go-stag. Its MongoDB change-stream client failed around 02:00 UTC and could not reconnect. GET /health returned 503 because csm.isHealthy() was false. That endpoint was wired to both readiness and liveness — so Kubernetes restarted the pod, the new pod failed /health for the same Mongo reason, and the loop repeated. Infinite restarts on a slow upstream.

The split, written into the runbook as PR #362:

livenessProbe:
  httpGet: { path: /healthz, port: 8080 }   # process alive only
readinessProbe:
  httpGet: { path: /health, port: 8080 }    # deps healthy (Mongo, change stream)

Liveness restarts the process. Readiness pulls a pod out of the load balancer. A slow upstream should drain traffic, not kill the pod that serves it.

Second OOM: the trade-off between hiking the limit and fixing the leak

April 22, two weeks later. chess-ai-service was OOMKilled mid-self-play. restartCount was 1283 over 16 days, usage 484Mi against a 512Mi limit (94.5%), any inference call would push it over. Pre-OOM logs were clean — Bun startup, MongoDB connect, baseline registration — so this was steady-state pressure, not a spike.

Steady-state alone — Bun runtime, MongoDB client, chess-ai-core, augmented-chess-engine, all 30 routes mounted — was ~480Mi. The 512Mi limit gave under 10% headroom. Every MCTS call allocated a fresh search tree, and per-game cached activations were not freed between games.

I shipped both fixes. Limit hike: requests 256Mi → 512Mi to match steady-state, limits 512Mi → 1024Mi for MCTS bursts. Leak fix: explicit delete on the MCTS root after each game, torch.cuda.empty_cache() between training games, periodic GC on the worker. A higher limit on a leaking process just delays the next OOM — the limit hike buys time, the leak fix earns it.

The backstop is the queue model from Stage 2. A worker OOMKilled mid-job leaves its row in claimed with a stale claimedAt; a janitor query reverts stale rows to pending and another worker picks them up. Crashes lose throughput, not data. That is the only reason I could ship to production with a known leak.

What this changes going forward

Memory limits cover steady-state plus peak burst, not whatever the test harness measured. Liveness and readiness check different things — process versus dependencies — and conflating them turns a slow upstream into a 2 a.m. restart loop. Self-play workloads are unusually leak-tolerant because the queue makes crashes recoverable, but tolerant is not free. Fix the leak.

#ai #kubernetes

AI workflow note

Claude drafted the five-stage breakdown via the planner agent before any code shipped, which let Stage 4 (Python training) and Stage 2 (TS API) run in parallel sub-agents without stepping on each other. The pytorch-build-resolver agent earned its keep on tensor-shape and CUDA errors during the cross-language fixture phase — encoder mismatches surface as silent shape errors twelve layers deep. The discipline that mattered most was running the worker locally for a full hour before deploying; the OOM trend was visible in top at minute 45, and I would rather see it on my laptop than in an Alertmanager page.