case study · 2026 · multi-agent llm

Polysignal-BTC5:
Can MAD(Multi-Agent-Debate) beat the crowd?

Three AI agents — a momentum analyst, a contrarian, and a judge — debate Bitcoin's next 5-minute price move on Polymarket. The dashboard below is a live snapshot of the system's actual performance over multiple session with total of 326 trades .

Python multi-agent LLMs prediction markets forecasting SQLite Streamlit

view the deck ↓ download slides view code download paper/report

Portfolio value

$114,719.68

+$14,719.68 (+14.72%)

Total trades

326

1 pending

Win rate

54.9%

179W / 147L

Predictions scored

326

Multi-session

Directional accuracy

54.9%

▲ vs 50% baseline

Avg Brier score

0.2490

↓ lower is better

Portfolio performance

2026-04-24 → 2026-04-28 · 326 trades

win loss balance

Confidence calibration

50–55% n=308

55.2% actual

52.9% confidence

55–60% n=17

52.9% actual

55.0% confidence

60–65% n=1

61.3% confidence (no wins)

65%+

no calls in this bucket

actual accuracy average confidence

UP vs DOWN calls

UP calls 51.8%

115 ✓

107 ✗

DOWN calls 61.5%

64 ✓

40 ✗

DOWN calls outperform UP — and they're rarer, suggesting the model picks its bearish moments selectively.

Per-model accuracy

xAI (Grok) 322 votes · brier 0.2461

55.9%

Claude (Judge) 322 votes · brier 0.2488

54.7%

OpenAI (GPT-4o-mini) 322 votes · brier 0.2551

47.5%

Momentum (Grok) beats the contrarian (GPT-4o-mini); judge sits in the middle. Vertical line = 50% baseline.

Recent trades

last 10 of 326

Time (UTC)	Direction	Odds	Bet	P&L	Balance after
2026-04-28 15:46	UP	50.5%	$500	-$500.00	$114,719.68
2026-04-28 14:25	UP	51.5%	$500	-$500.00	$115,219.68
2026-04-28 14:20	UP	50.5%	$500	-$500.00	$115,719.68
2026-04-28 14:15	UP	50.5%	$500	-$500.00	$116,219.68
2026-04-28 14:10	UP	50.5%	$500	+$490.10	$116,719.68
2026-04-28 14:05	UP	50.5%	$500	-$500.00	$116,229.58
2026-04-28 14:00	DOWN	49.5%	$500	-$500.00	$116,729.58
2026-04-28 13:55	UP	50.5%	$500	-$500.00	$117,229.58
2026-04-28 13:50	UP	49.5%	$500	+$510.10	$117,729.58
2026-04-28 13:45	DOWN	49.5%	$500	+$510.10	$117,219.48

This is a static snapshot of the live Streamlit dashboard (dashboard.py), rendered from polysignal_btc.db as of 2026-04-28 15:46 UTC. The live version auto-refreshes every 30 seconds during a trading session.

how it works · interpretation

Why 5-minute Bitcoin markets?

Polymarket's BTC up/down 5-minute markets resolve via a Chainlink oracle — a clean, objective binary outcome with no ambiguity. That gives me a research environment with 288 markets per day, dense feedback, and a strong baseline already encoded in the crowd's odds.

"A large group of diverse individuals will come up with better and more robust forecasts and make more intelligent decisions than even the experts." — James Surowiecki, The Wisdom of Crowds (p.41)

If markets are fully efficient, no AI should beat the crowd. The question I wanted to answer: can a multi-agent LLM debate system extract directional signal beyond what the crowd already knows?

System architecture

An orchestrator triggers four steps every five minutes, synchronized to the Polymarket market clock. Predictions and evaluations are persisted to SQLite — which is exactly what the dashboard above reads from.

Data sources

Source	Signal	Latency
Kraken WebSocket	live BTC price	~100ms
Kraken REST OHLCV	RSI(14), volatility, 1/5/15-min momentum	~1s
Polymarket Gamma API	crowd UP/DOWN odds	~2s
Alternative.me	Fear & Greed Index (0–100)	~1s

Multi-Agent Debate

Each prediction runs through three rounds in roughly 8–14 seconds. Round 1 produces independent positions for diversity. Round 2 forces conflict via cross-examination. Round 3 aggregates everything into a single verdict.

round 1 · diversity

Independent

Grok → momentum position
GPT-4o-mini → contrarian position

round 2 · conflict

Cross-examination

Grok reads GPT → rebuttal
GPT reads Grok → rebuttal

round 3 · aggregation

Judgment

Claude Sonnet reads all 4
direction + confidence + reasoning

Reading the dashboard

Two metrics matter for this kind of system. Both appear in the KPI strip up top.

Directional accuracy — did the model predict the correct movement? Treats 51% and 99% predictions equally, so it answers "would this signal make money?" At 54.9% over 326 trades, the system is meaningfully above the 50% coin-flip baseline.
Brier score — how confident and calibrated was the prediction? Heavily penalizes overconfident errors. 0.25 is the random baseline; the system clocks 0.2490, just under random — which lines up with the calibration panel showing the model rarely strays past 55% confidence.

The portfolio curve shows the actual P&L trajectory at a flat $500/trade sizing. Each green dot is a winning trade; each gold dot a loser. The drift is upward but lumpy — a reminder that 54.9% accuracy still loses 45% of the time.

What the numbers say

Three things stand out from the dashboard:

Momentum dominates short-term signal. Grok at 55.9% is the strongest single agent. The contrarian (GPT-4o-mini) sits at 47.5% — basically a coin flip — which suggests fading momentum at 5-minute horizons is a losing strategy.
The judge adds value over the contrarian, not Grok. Claude's 54.7% sits between the two — sometimes deferring to Grok, sometimes blending. When the judge sided with Grok, accuracy jumped above 60%; when it sided with GPT, it cratered.
Calibration is good but conservative. 308 of 326 calls land in the 50–55% confidence bucket, where actual accuracy was 55.2% — almost perfectly calibrated. The model rarely commits past 55% because the underlying signal genuinely is weak.

Limitations & future work

I want to be honest about what 326 trades over a calm ~5-day window can and can't tell us:

Statistical power — confidence intervals are still wider than I'd like; a 2,000+ trade run is the goal.
No live capital — simulated P&L only; the actual market is restricted to US users.
Weak contrarian — GPT-4o-mini at 47.5% may be too small for the role.
Single regime — no evaluation across high/low volatility regimes yet.
Lite models only — reasoning-tier models (Opus, GPT-5) would likely shift the picture.

What's next

Longer evaluation — extending toward 7+ days, 2,000+ trades.
Stronger contrarian — swap in Claude Opus or GPT-5 for the contrarian role.
Adaptive sizing — Kelly-style sizing instead of flat $500/trade.
Multi-horizon — test 15-min and 30-min markets.
Live deployment — real capital to validate the simulated results.

The takeaway

Multi-agent debate produces a modest but real edge over the 50% baseline — 54.9% directional accuracy and a +14.72% simulated return over multiple sessions. Structured disagreement, then aggregation, extracts incremental predictive signal — and crucially, the system stays well-calibrated rather than overconfident. Preliminary, but the direction is encouraging.

The deck

The full presentation walks through motivation, architecture, the MAD pipeline, and results.

Polysignal-BTC5.pdf · 15 slides download pdf ↓

← back to projects

Polysignal-BTC5:Can MAD(Multi-Agent-Debate) beat the crowd?