AI notesHomePlay

Building strong AIs for Gabo

An 18-bot AI ladder, two tournament formats, two rounds of autotuning, MCTS, and the methods and lessons behind chasing the strongest practical bot.

Champion

Cyclone

tracker + retune + finisher

Ladder span

2,050 Elo

Cyclone over Random (1000)

Bots ranked

head-to-head, thousands of games

Cyclone vs Apex

71%

prev. champion beaten by retune

TL;DR

Gabo is a 1 to 6 player memory-and-matching card game. We built 19 distinct bots ranging from random to Cyclone, the current champion. Cyclone stacks three independent gains: stateful opponent tracking(per-slot value adjustments inferred from how the opponent's hand changes), autotuned thresholds re-fitted to the current ruleset, and an endgame give-to-win finisher that empties the hand for an instant win.

Cyclone beats the previous champion Apex 71% and Vortex (Apex + finisher) 70%. The headline lesson: when the game rules changed (seeded discard, permanent holes, empty-hand win), the old autotuned parameters went stale, and re-tuning recovered ~12 percentage points. The opponent tracker remained the single most valuable component throughout, worth more than any amount of parameter tuning alone.

The roster

Each bot is one strategy. The Belief family is shown in development order. Personality bots (Beginner, MatchHunter, etc.) sit lower on the ladder and exist to fill out the Elo curve.

Bot	Idea
Belief	Bayesian rank-residual baseline
Belief+	Tighter swap and Gabo thresholds, aggressive abilities
Duel	First bot tuned for 1-vs-1 play
Echo	Duel + stateful opponent action tracking
Forge	Autotune round 1 (random search, 24 candidates)
Hammer	Autotune round 2 (hill climb, 30 candidates)
Apex	Hammer params + opponent tracker.
Tempest	Re-tuned params for the current ruleset (no tracker).
Vortex	Apex + empty-hand give-to-win finisher.
Cyclone	Tracker + re-tuned params + give-to-win. Current champion.
Searcher	Determinized search using Belief playouts. About 100x slower per move.

The empirical ladder

18-bot 1v1 round-robin. 800 games per pair, ~122k games total. Elo fit via Bradley-Terry maximum-likelihood with Laplace smoothing; Random anchored to 1000.

#1Cyclone

3,050

#2Vortex

2,880

#3Apex

2,822

#4Tempest

2,700

#5Hammer

2,560

#6Forge

2,511

#7Duel

2,398

#8Echo

2,397

#9Belief+

2,225

#10Belief

2,062

#11Heuristic

2,017

#12PeekHoarder

1,905

#13Beginner

1,850

#14NeverGabo

1,767

#15GreedyGabo

1,641

#16DiscardOnly

1,429

#17AlwaysDraw

1,400

#18MatchHunter

1,054

#19Random

1,000

Win-rate heatmap

Each cell is the percentage of head-to-head games the row bot won against the column bot. Green = win, red = loss, slate = even. Bottom-tier bots (Random through MatchHunter) omitted from the matrix since they all lose 0-100 to the top tier.

	Apex	Edge	Spire	Hammer	Forge	Echo	Duel	Belief+	Belief	Heuristic	PeekHoarder	Beginner
Apex	·	57	70	82	87	97	95	99	100	100	100	100
Edge	43	·	65	77	83	95	94	99	100	100	100	100
Spire	30	35	·	88	93	92	98	100	100	100	100	100
Hammer	18	23	12	·	58	79	81	98	100	100	100	100
Forge	13	17	7	42	·	64	76	97	99	100	100	100
Echo	3	5	8	21	36	·	49	86	98	99	100	100
Duel	5	6	2	19	24	51	·	92	99	100	100	100
Belief+	1	1	0	2	3	14	8	·	95	100	99	100
Belief	0	0	0	0	1	2	1	5	·	67	96	100
Heuristic	0	0	0	0	0	1	0	0	33	·	96	100
PeekHoarder	0	0	0	0	0	0	0	1	4	4	·	61
Beginner	0	0	0	0	0	0	0	0	0	0	39	·

0-30%

50%

70-100%

Head-to-head highlights

Row-on-left wins this percentage. The gap between the top three and the rest is brutal, 100/0 in most matchups.

CycloneApex

71%29%

re-tune + finisher beats old champion

CycloneVortex

70%30%

VortexApex

53%47%

give-to-win finisher edge

TempestHammer

75%25%

current-rules retune beats stale

ApexTempest

74%26%

tracker beats better params

ApexHammer

89%11%

ForgeDuel

76%24%

autotune beats hand-tuning

BeliefHeuristic

67%33%

Bayesian beats constant-estimate

PeekHoarderNeverGabo

74%26%

info beats stubbornness

What the autotuner discovered

The same Belief shell, with different parameter values fitted by random search and hill climbing. The direction of travel matters: swap-delta keeps climbing (less swapping, not more), Gabo cutoffs tighten (call when more confident), and give-card threshold drops (give almost anything after a match).

Parameter	Duel	Forge	Hammer	Edge
swap delta	0.6	1.20	1.57	2.41
Gabo P critical	0.18	0.21	0.13	0.06
Gabo P risky	0.08	0.10	0.13	0.05
match speculation	0.22	0.22	0.27	0.19
swap hands delta	4.0	4.9	3.5	2.6
give-card threshold	5	3.5	2	1.7

Searcher: the search-based bot

Determinized search. For each candidate top-of-turn action, sample 50 hidden-state determinizations from belief, apply the action, play out both sides using Belief under perfect information, pick the action with the best mean round-score. About 100 times slower per move than the heuristic bots.

vs Random

100%

vs Heuristic

100%

vs Belief+

97%

vs Duel

56%

vs Echo

56%

vs Forge

30%

Searcher lands between Echo and Forge in strength. Loses 30 to Forge and worse to Hammer / Apex. The autotuned heuristic beats explicit search at a fraction of the runtime cost.

4-player tournament

3,000 games with 4 bots randomly chosen from the 16-bot roster each game. The 1v1-tuned champion also dominates 4-player. Apex wins 66.7% of games it plays in and has an average finishing rank of 1.42 out of 4.

Bot	Wins	Win %	Avg rank
Apex	498	66.7%	1.42
Forge	431	59.1%	1.57
Hammer	425	58.9%	1.56
Echo	397	51.0%	1.64
Duel	396	52.0%	1.65
Belief+	289	38.2%	1.91
Belief	188	25.8%	2.16
Heuristic	143	18.1%	2.33
PeekHoarder	71	9.5%	2.60
Beginner	64	8.1%	2.66
NeverGabo	41	5.3%	2.90
GreedyGabo	27	3.7%	3.03
DiscardOnly	15	2.0%	3.38
AlwaysDraw	14	2.0%	3.45
Random	0	0.0%	3.84
MatchHunter	1	0.1%	3.92

Development timeline

1
Pure engine first
Deterministic state machine. 26-action discriminated union, 22-event union, seedable RNG, per-player view filtering. No UI, no AI. Type-checked smoke test ran a full round before any bot existed.
2
Random + Heuristic baselines
Random validates the engine doesn't deadlock. Heuristic uses a constant-7 estimate for any unseen card to drive swap/Gabo decisions.
3
Belief tracker
Bayesian rank-residual. For an unknown card, P(rank=r) = unknownCount[r] / totalUnknown. Strictly better than the constant estimate.
4
Belief+ general tuning
Tighter swap threshold, aggressive ability use, a 'speculative Gabo' branch that calls when expected total clearly beats opponents.
5
Duel (1v1 specialization)
Two-player tuning. Bayesian Gabo using normal-approximation of P(opp_sum < my_sum). Always activates info abilities. Tightest swap and match thresholds.
6
Echo (opponent action tracking)
Stateful bot. Diffs opp's hand turn-over-turn. When a slot's card-id changes and the new card isn't in our seen set, the opponent kept a deliberately-chosen card. Mark the slot with a lower expected value.
7
Personality bots
Six single-rule bots (AlwaysDraw, DiscardOnly, MatchHunter, GreedyGabo, NeverGabo, PeekHoarder) populate the mid-low ladder so the Elo curve is smooth, not a top-tier oasis.
8
Knowledge tracker fix
The latent bug that pinned the ceiling. Public-card exposures (everything that touches the discard pile, plus successful matches and given cards) now propagate to every player's seenCardIds. Bots' actual ability to remember matches the rules. Top-tier separation widened by 20 plus points.
9
Forge (autotune round 1)
Random-search over Belief's parameter box. 24 candidates, 200 games each. Winner improved gauntlet winrate by 11.5pp. Big insight: aggressive swapping was a NET NEGATIVE because the displaced card lands publicly on the discard pile, leaking information faster than the swap improves the hand.
10
Searcher (MCTS)
For each top-of-turn action, sample 50 determinizations of opponent's hand from belief, apply the candidate, then play out using Belief perfect-info rollouts. Beats Duel about 57% but at 100x the per-move cost. Lands roughly tied with Forge.
11
Hammer (autotune round 2)
Hill-climbed around Forge for 30 candidates × 350 games. Found 93% gauntlet winrate (+6.2pp). Discovered direction: even more conservative swapping (1.57 vs 1.2) and more aggressive Gabo when confident (P-critical 0.13 vs 0.21).
12
Apex (autotune + tracker)
Hammer's params + Echo's per-slot adjustments. The two ideas stack into the current champion. 95% vs Duel, 82% vs Hammer, 100% vs everything below the top tier.
13
Edge + Spire (round 3)
Autotune round 3 against a tougher gauntlet (including Hammer itself). Edge has very conservative swap (2.41) and very aggressive Gabo. Spire adds Echo on top. Both end up slightly weaker than Apex head-to-head; the round-3 params conflict with the tracker's nudges.

Methods

Autotuning the strongest bot

We have a parameterized Belief variant where every threshold (swap delta, speculative-match probability, Gabo confidence cutoffs, give-card threshold) is a constructor argument. A random-search script samples points in a 6-dimensional box, evaluates each candidate against a 5-bot gauntlet (200 to 350 games per candidate), and reports the best. Forge = round 1, Hammer = round 2, Apex = Hammer + Echo.

params = sampleRandom()
for candidate in candidates:
  winRate = playGauntlet(candidate, gauntlet, 1000 games)
  if winRate > best.rate: best = candidate

Per-player knowledge

Each seat carries a Set of card IDs whose identity that player has observed. Events update it: own peeks, ability reveals visible to that player, public discards, public match results. AIs that read from this Set play with truthful information.

Belief: Bayesian rank-residual

Enumerate the 54-card deck. Subtract every card on the discard pile (public) plus every card the viewer has personally seen. The remainder is the unknown pool. For any unknown card with no other info, P(rank=r) = unknownCount[r] / totalUnknown. Expected value follows by linearity.

for each unknown card:
  P(rank = r) = unknownCount[r] / totalUnknown
E[value] = sum_r P(r) * value(r)

Bayesian Gabo timing

Replace heuristic Gabo thresholds with explicit probability estimation. Estimate (mean, variance) of my hand sum and opp's hand sum from known plus belief. Compute P(opp lower) using a normal approximation. Call Gabo when this probability is comfortably small.

diff = oppSum - mySum ~ Normal(mu, sigma_squared)
P(I'm lower) = Phi(-mu / sigma)
Call Gabo when P(I'm lower) > 0.87  (Apex threshold)

Match speculation

Known matches always claim (reaction 0.5 to 0.8s). When P(rank = discard top) exceeds threshold, speculate on an unknown slot (about 1.0 to 1.4s). Hammer uses 0.22, Belief+ uses 0.27, plain Belief uses 0.32. Lower threshold = more attempted matches = more penalty cards on misses but more board control on hits.

Opponent action tracking (Echo, Apex, Spire)

Stateful: snapshot opp's slot card-ids each turn. If a slot's id changes and the new card isn't in our seenCardIds, the opponent drew from the deck and chose to keep it. A rational opponent only keeps cards better than their worst slot, so the new card has below-average expected value. Mark the slot with an override (2.5 = the conditional mean of cards a strong opp would keep).

What didn't work

The interesting failures. Each one taught us something about the surface.

First MCTS attempt

Heuristic playouts with only 18 rollouts and a broken round-end scoring path (callGabo's value was computed from raw hand sum, missing the +15 miscall penalty / -10 winner bonus). Lost to every bot including Heuristic. Fixed: switched to Belief playouts at 50 rollouts, corrected the round-history baseline. Now Searcher is competitive but slow.

OppTracker constant override (overcooked at 4.0)

Marking deliberately-kept slots as expected value 4.0 overestimated kept-card strength. Lowered to 2.5 (the conditional mean of cards a rational opp would keep) and Echo now ties Duel.

Spire (autotune r3 + tracker)

Adding the opponent tracker on top of Edge's aggressive parameters didn't compound. Edge's very low Gabo cutoff conflicts with the tracker's nudges (which downgrade opp's apparent total). Apex (Hammer + tracker) remains the strongest.

Lessons

1Get the boring infrastructure right first. The knowledge tracker bug was invisible until tournaments at scale exposed it.
2Autotuning beats hand-tuning even when humans wrote the heuristic. Random search over 6 parameters found Forge that beats hand-tuned Duel by 74-26, and the winning direction (less aggressive swapping) was counterintuitive.
3Two ideas often stack. Hammer + Echo gives Apex. The combined approach beats either alone by a wider margin.
4Specialization generalizes. The 1v1-tuned bot also won the 4-player tournament. The underlying ideas (Bayesian Gabo, info abilities, tight thresholds) are format-agnostic.
5Search is not always worth it. After fixing MCTS, it matches Forge, but at 100x the per-move runtime. The autotuned heuristic is the better practical choice.
6Pure functions make research cheap. A 30,000-game tournament runs in about 130s in TypeScript with no optimization.
7Elo math breaks on 100/0 matchups. Laplace smoothing scaled to game count keeps the ladder finite.
8Tie behavior matters in real-time mechanics. Match windows had implicit seat-order bias until 80ms-tied claims were randomized.

What's next

In rough order of expected return on effort. Without one of these, Apex sits at the practical ceiling for hand-engineered Gabo.

Idea	Est. gain	Cost
Autotune round 4 with finer perturbation around Apex's params	1 to 3 percentage points	1 to 2h sim
Time-budgeted MCTS with UCB1 selection + Apex playouts	5 to 10% over Searcher	1 to 2 days dev
Self-play neural net (policy + value)	Unknown ceiling	1 to 2 weeks plus GPU
Game knob: lower call threshold to 3	Compresses AI gap, makes Gabo timing harder	10 min

Reproduce locally

# 1v1 round-robin (full ladder)
npx tsx src/lib/sim/__1v1__.ts 1000

# 4-player random-seat tournament
npx tsx src/lib/sim/__tournament__.ts 3000

# Top contender head-to-head
npx tsx src/lib/sim/__test1v9__.ts 800

# Searcher vs the ladder (slow)
npx tsx src/lib/sim/__mcts_test__.ts 50

# Random-search autotune
npx tsx src/lib/sim/__autotune__.ts 24 200

# Hill-climb autotune around the current best
npx tsx src/lib/sim/__autotune2__.ts 30 350

Engine: src/lib/engine/. Bots: src/lib/ai/. Simulator + Elo: src/lib/sim/. UI: src/components/solo/.

Source-code mapping (for repo explorers)

Internal source files were named in development order before the display names landed. Use this table to map a bot to its source if you're reading the code.

Display	Source file / class
Belief	Belief
Belief+	Belief+
Duel	Belief-1v1
Echo	Belief-1v2
Forge	Belief-1v4
Hammer	Belief-1v6
Apex	Belief-1v7
Tempest	Belief-2v1
Vortex	Belief-1v7+EG
Cyclone	Belief-2v2+EG
Searcher	MCTS-1v1

The roster

The empirical ladder

Win-rate heatmap

Head-to-head highlights

What the autotuner discovered

Searcher: the search-based bot

4-player tournament

Development timeline

Pure engine first

Random + Heuristic baselines

Belief tracker

Belief+ general tuning

Duel (1v1 specialization)

Echo (opponent action tracking)

Personality bots

Knowledge tracker fix

Forge (autotune round 1)

Searcher (MCTS)

Hammer (autotune round 2)

Apex (autotune + tracker)

Edge + Spire (round 3)