Building strong AIs for Gabo
An 18-bot AI ladder, two tournament formats, two rounds of autotuning, MCTS, and the methods and lessons behind chasing the strongest practical bot.
Gabo is a 1 to 6 player memory-and-matching card game. We built 19 distinct bots ranging from random to Cyclone, the current champion. Cyclone stacks three independent gains: stateful opponent tracking(per-slot value adjustments inferred from how the opponent's hand changes), autotuned thresholds re-fitted to the current ruleset, and an endgame give-to-win finisher that empties the hand for an instant win.
Cyclone beats the previous champion Apex 71% and Vortex (Apex + finisher) 70%. The headline lesson: when the game rules changed (seeded discard, permanent holes, empty-hand win), the old autotuned parameters went stale, and re-tuning recovered ~12 percentage points. The opponent tracker remained the single most valuable component throughout, worth more than any amount of parameter tuning alone.
The roster
Each bot is one strategy. The Belief family is shown in development order. Personality bots (Beginner, MatchHunter, etc.) sit lower on the ladder and exist to fill out the Elo curve.
| Bot | Idea |
|---|---|
| Belief | Bayesian rank-residual baseline |
| Belief+ | Tighter swap and Gabo thresholds, aggressive abilities |
| Duel | First bot tuned for 1-vs-1 play |
| Echo | Duel + stateful opponent action tracking |
| Forge | Autotune round 1 (random search, 24 candidates) |
| Hammer | Autotune round 2 (hill climb, 30 candidates) |
| Apex | Hammer params + opponent tracker. |
| Tempest | Re-tuned params for the current ruleset (no tracker). |
| Vortex | Apex + empty-hand give-to-win finisher. |
| Cyclone | Tracker + re-tuned params + give-to-win. Current champion. |
| Searcher | Determinized search using Belief playouts. About 100x slower per move. |
The empirical ladder
18-bot 1v1 round-robin. 800 games per pair, ~122k games total. Elo fit via Bradley-Terry maximum-likelihood with Laplace smoothing; Random anchored to 1000.
Win-rate heatmap
Each cell is the percentage of head-to-head games the row bot won against the column bot. Green = win, red = loss, slate = even. Bottom-tier bots (Random through MatchHunter) omitted from the matrix since they all lose 0-100 to the top tier.
| Apex | Edge | Spire | Hammer | Forge | Echo | Duel | Belief+ | Belief | Heuristic | PeekHoarder | Beginner | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Apex | · | 57 | 70 | 82 | 87 | 97 | 95 | 99 | 100 | 100 | 100 | 100 |
| Edge | 43 | · | 65 | 77 | 83 | 95 | 94 | 99 | 100 | 100 | 100 | 100 |
| Spire | 30 | 35 | · | 88 | 93 | 92 | 98 | 100 | 100 | 100 | 100 | 100 |
| Hammer | 18 | 23 | 12 | · | 58 | 79 | 81 | 98 | 100 | 100 | 100 | 100 |
| Forge | 13 | 17 | 7 | 42 | · | 64 | 76 | 97 | 99 | 100 | 100 | 100 |
| Echo | 3 | 5 | 8 | 21 | 36 | · | 49 | 86 | 98 | 99 | 100 | 100 |
| Duel | 5 | 6 | 2 | 19 | 24 | 51 | · | 92 | 99 | 100 | 100 | 100 |
| Belief+ | 1 | 1 | 0 | 2 | 3 | 14 | 8 | · | 95 | 100 | 99 | 100 |
| Belief | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 5 | · | 67 | 96 | 100 |
| Heuristic | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 33 | · | 96 | 100 |
| PeekHoarder | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 4 | · | 61 |
| Beginner | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 39 | · |
Head-to-head highlights
Row-on-left wins this percentage. The gap between the top three and the rest is brutal, 100/0 in most matchups.
What the autotuner discovered
The same Belief shell, with different parameter values fitted by random search and hill climbing. The direction of travel matters: swap-delta keeps climbing (less swapping, not more), Gabo cutoffs tighten (call when more confident), and give-card threshold drops (give almost anything after a match).
| Parameter | Duel | Forge | Hammer | Edge |
|---|---|---|---|---|
| swap delta | 0.6 | 1.20 | 1.57 | 2.41 |
| Gabo P critical | 0.18 | 0.21 | 0.13 | 0.06 |
| Gabo P risky | 0.08 | 0.10 | 0.13 | 0.05 |
| match speculation | 0.22 | 0.22 | 0.27 | 0.19 |
| swap hands delta | 4.0 | 4.9 | 3.5 | 2.6 |
| give-card threshold | 5 | 3.5 | 2 | 1.7 |
Searcher: the search-based bot
Determinized search. For each candidate top-of-turn action, sample 50 hidden-state determinizations from belief, apply the action, play out both sides using Belief under perfect information, pick the action with the best mean round-score. About 100 times slower per move than the heuristic bots.
Searcher lands between Echo and Forge in strength. Loses 30 to Forge and worse to Hammer / Apex. The autotuned heuristic beats explicit search at a fraction of the runtime cost.
4-player tournament
3,000 games with 4 bots randomly chosen from the 16-bot roster each game. The 1v1-tuned champion also dominates 4-player. Apex wins 66.7% of games it plays in and has an average finishing rank of 1.42 out of 4.
| Bot | Wins | Win % | Avg rank |
|---|---|---|---|
| Apex | 498 | 66.7% | 1.42 |
| Forge | 431 | 59.1% | 1.57 |
| Hammer | 425 | 58.9% | 1.56 |
| Echo | 397 | 51.0% | 1.64 |
| Duel | 396 | 52.0% | 1.65 |
| Belief+ | 289 | 38.2% | 1.91 |
| Belief | 188 | 25.8% | 2.16 |
| Heuristic | 143 | 18.1% | 2.33 |
| PeekHoarder | 71 | 9.5% | 2.60 |
| Beginner | 64 | 8.1% | 2.66 |
| NeverGabo | 41 | 5.3% | 2.90 |
| GreedyGabo | 27 | 3.7% | 3.03 |
| DiscardOnly | 15 | 2.0% | 3.38 |
| AlwaysDraw | 14 | 2.0% | 3.45 |
| Random | 0 | 0.0% | 3.84 |
| MatchHunter | 1 | 0.1% | 3.92 |
Development timeline
- 1
Pure engine first
Deterministic state machine. 26-action discriminated union, 22-event union, seedable RNG, per-player view filtering. No UI, no AI. Type-checked smoke test ran a full round before any bot existed.
- 2
Random + Heuristic baselines
Random validates the engine doesn't deadlock. Heuristic uses a constant-7 estimate for any unseen card to drive swap/Gabo decisions.
- 3
Belief tracker
Bayesian rank-residual. For an unknown card, P(rank=r) = unknownCount[r] / totalUnknown. Strictly better than the constant estimate.
- 4
Belief+ general tuning
Tighter swap threshold, aggressive ability use, a 'speculative Gabo' branch that calls when expected total clearly beats opponents.
- 5
Duel (1v1 specialization)
Two-player tuning. Bayesian Gabo using normal-approximation of P(opp_sum < my_sum). Always activates info abilities. Tightest swap and match thresholds.
- 6
Echo (opponent action tracking)
Stateful bot. Diffs opp's hand turn-over-turn. When a slot's card-id changes and the new card isn't in our seen set, the opponent kept a deliberately-chosen card. Mark the slot with a lower expected value.
- 7
Personality bots
Six single-rule bots (AlwaysDraw, DiscardOnly, MatchHunter, GreedyGabo, NeverGabo, PeekHoarder) populate the mid-low ladder so the Elo curve is smooth, not a top-tier oasis.
- 8
Knowledge tracker fix
The latent bug that pinned the ceiling. Public-card exposures (everything that touches the discard pile, plus successful matches and given cards) now propagate to every player's seenCardIds. Bots' actual ability to remember matches the rules. Top-tier separation widened by 20 plus points.
- 9
Forge (autotune round 1)
Random-search over Belief's parameter box. 24 candidates, 200 games each. Winner improved gauntlet winrate by 11.5pp. Big insight: aggressive swapping was a NET NEGATIVE because the displaced card lands publicly on the discard pile, leaking information faster than the swap improves the hand.
- 10
Searcher (MCTS)
For each top-of-turn action, sample 50 determinizations of opponent's hand from belief, apply the candidate, then play out using Belief perfect-info rollouts. Beats Duel about 57% but at 100x the per-move cost. Lands roughly tied with Forge.
- 11
Hammer (autotune round 2)
Hill-climbed around Forge for 30 candidates × 350 games. Found 93% gauntlet winrate (+6.2pp). Discovered direction: even more conservative swapping (1.57 vs 1.2) and more aggressive Gabo when confident (P-critical 0.13 vs 0.21).
- 12
Apex (autotune + tracker)
Hammer's params + Echo's per-slot adjustments. The two ideas stack into the current champion. 95% vs Duel, 82% vs Hammer, 100% vs everything below the top tier.
- 13
Edge + Spire (round 3)
Autotune round 3 against a tougher gauntlet (including Hammer itself). Edge has very conservative swap (2.41) and very aggressive Gabo. Spire adds Echo on top. Both end up slightly weaker than Apex head-to-head; the round-3 params conflict with the tracker's nudges.
Methods
Autotuning the strongest bot
We have a parameterized Belief variant where every threshold (swap delta, speculative-match probability, Gabo confidence cutoffs, give-card threshold) is a constructor argument. A random-search script samples points in a 6-dimensional box, evaluates each candidate against a 5-bot gauntlet (200 to 350 games per candidate), and reports the best. Forge = round 1, Hammer = round 2, Apex = Hammer + Echo.
params = sampleRandom() for candidate in candidates: winRate = playGauntlet(candidate, gauntlet, 1000 games) if winRate > best.rate: best = candidate
Per-player knowledge
Each seat carries a Set of card IDs whose identity that player has observed. Events update it: own peeks, ability reveals visible to that player, public discards, public match results. AIs that read from this Set play with truthful information.
Belief: Bayesian rank-residual
Enumerate the 54-card deck. Subtract every card on the discard pile (public) plus every card the viewer has personally seen. The remainder is the unknown pool. For any unknown card with no other info, P(rank=r) = unknownCount[r] / totalUnknown. Expected value follows by linearity.
for each unknown card: P(rank = r) = unknownCount[r] / totalUnknown E[value] = sum_r P(r) * value(r)
Bayesian Gabo timing
Replace heuristic Gabo thresholds with explicit probability estimation. Estimate (mean, variance) of my hand sum and opp's hand sum from known plus belief. Compute P(opp lower) using a normal approximation. Call Gabo when this probability is comfortably small.
diff = oppSum - mySum ~ Normal(mu, sigma_squared) P(I'm lower) = Phi(-mu / sigma) Call Gabo when P(I'm lower) > 0.87 (Apex threshold)
Match speculation
Known matches always claim (reaction 0.5 to 0.8s). When P(rank = discard top) exceeds threshold, speculate on an unknown slot (about 1.0 to 1.4s). Hammer uses 0.22, Belief+ uses 0.27, plain Belief uses 0.32. Lower threshold = more attempted matches = more penalty cards on misses but more board control on hits.
Opponent action tracking (Echo, Apex, Spire)
Stateful: snapshot opp's slot card-ids each turn. If a slot's id changes and the new card isn't in our seenCardIds, the opponent drew from the deck and chose to keep it. A rational opponent only keeps cards better than their worst slot, so the new card has below-average expected value. Mark the slot with an override (2.5 = the conditional mean of cards a strong opp would keep).
What didn't work
The interesting failures. Each one taught us something about the surface.
First MCTS attempt
Heuristic playouts with only 18 rollouts and a broken round-end scoring path (callGabo's value was computed from raw hand sum, missing the +15 miscall penalty / -10 winner bonus). Lost to every bot including Heuristic. Fixed: switched to Belief playouts at 50 rollouts, corrected the round-history baseline. Now Searcher is competitive but slow.
OppTracker constant override (overcooked at 4.0)
Marking deliberately-kept slots as expected value 4.0 overestimated kept-card strength. Lowered to 2.5 (the conditional mean of cards a rational opp would keep) and Echo now ties Duel.
Spire (autotune r3 + tracker)
Adding the opponent tracker on top of Edge's aggressive parameters didn't compound. Edge's very low Gabo cutoff conflicts with the tracker's nudges (which downgrade opp's apparent total). Apex (Hammer + tracker) remains the strongest.
Lessons
- 1Get the boring infrastructure right first. The knowledge tracker bug was invisible until tournaments at scale exposed it.
- 2Autotuning beats hand-tuning even when humans wrote the heuristic. Random search over 6 parameters found Forge that beats hand-tuned Duel by 74-26, and the winning direction (less aggressive swapping) was counterintuitive.
- 3Two ideas often stack. Hammer + Echo gives Apex. The combined approach beats either alone by a wider margin.
- 4Specialization generalizes. The 1v1-tuned bot also won the 4-player tournament. The underlying ideas (Bayesian Gabo, info abilities, tight thresholds) are format-agnostic.
- 5Search is not always worth it. After fixing MCTS, it matches Forge, but at 100x the per-move runtime. The autotuned heuristic is the better practical choice.
- 6Pure functions make research cheap. A 30,000-game tournament runs in about 130s in TypeScript with no optimization.
- 7Elo math breaks on 100/0 matchups. Laplace smoothing scaled to game count keeps the ladder finite.
- 8Tie behavior matters in real-time mechanics. Match windows had implicit seat-order bias until 80ms-tied claims were randomized.
What's next
In rough order of expected return on effort. Without one of these, Apex sits at the practical ceiling for hand-engineered Gabo.
| Idea | Est. gain | Cost |
|---|---|---|
| Autotune round 4 with finer perturbation around Apex's params | 1 to 3 percentage points | 1 to 2h sim |
| Time-budgeted MCTS with UCB1 selection + Apex playouts | 5 to 10% over Searcher | 1 to 2 days dev |
| Self-play neural net (policy + value) | Unknown ceiling | 1 to 2 weeks plus GPU |
| Game knob: lower call threshold to 3 | Compresses AI gap, makes Gabo timing harder | 10 min |
Reproduce locally
# 1v1 round-robin (full ladder) npx tsx src/lib/sim/__1v1__.ts 1000 # 4-player random-seat tournament npx tsx src/lib/sim/__tournament__.ts 3000 # Top contender head-to-head npx tsx src/lib/sim/__test1v9__.ts 800 # Searcher vs the ladder (slow) npx tsx src/lib/sim/__mcts_test__.ts 50 # Random-search autotune npx tsx src/lib/sim/__autotune__.ts 24 200 # Hill-climb autotune around the current best npx tsx src/lib/sim/__autotune2__.ts 30 350
Engine: src/lib/engine/. Bots: src/lib/ai/. Simulator + Elo: src/lib/sim/. UI: src/components/solo/.
Source-code mapping (for repo explorers)
Internal source files were named in development order before the display names landed. Use this table to map a bot to its source if you're reading the code.
| Display | Source file / class |
|---|---|
| Belief | Belief |
| Belief+ | Belief+ |
| Duel | Belief-1v1 |
| Echo | Belief-1v2 |
| Forge | Belief-1v4 |
| Hammer | Belief-1v6 |
| Apex | Belief-1v7 |
| Tempest | Belief-2v1 |
| Vortex | Belief-1v7+EG |
| Cyclone | Belief-2v2+EG |
| Searcher | MCTS-1v1 |