← back to projects
case study · 2026 · multi-agent llm

Polysignal-BTC5:
Can MAD(Multi-Agent-Debate) beat the crowd?

Three AI agents — a momentum analyst, a contrarian, and a judge — debate Bitcoin's next 5-minute price move on Polymarket. The dashboard below is a live snapshot of the system's actual performance over multiple session with total of 326 trades .

Python multi-agent LLMs prediction markets forecasting SQLite Streamlit
view the deck ↓ download slides view code download paper/report
₿ PolySignal
BTC 5-min prediction market · multi-agent AI ensemble
data snapshot
as of 2026-04-28 15:46 UTC
Portfolio value
$114,719.68
+$14,719.68 (+14.72%)
Total trades
326
1 pending
Win rate
54.9%
179W / 147L
Predictions scored
326
Multi-session
Directional accuracy
54.9%
▲ vs 50% baseline
Avg Brier score
0.2490
↓ lower is better
Portfolio performance
2026-04-24 → 2026-04-28 · 326 trades
$100,000 $105,000 $110,000 $115,000 $120,000 starting balance · $100,000
win loss balance
Confidence calibration
50–55% n=308
55.2% actual
52.9% confidence
55–60% n=17
52.9% actual
55.0% confidence
60–65% n=1
61.3% confidence (no wins)
65%+
no calls in this bucket
actual accuracy average confidence
UP vs DOWN calls
UP calls 51.8%
115 ✓
107 ✗
DOWN calls 61.5%
64 ✓
40 ✗

DOWN calls outperform UP — and they're rarer, suggesting the model picks its bearish moments selectively.

Per-model accuracy
xAI (Grok) 322 votes · brier 0.2461
55.9%
Claude (Judge) 322 votes · brier 0.2488
54.7%
OpenAI (GPT-4o-mini) 322 votes · brier 0.2551
47.5%

Momentum (Grok) beats the contrarian (GPT-4o-mini); judge sits in the middle. Vertical line = 50% baseline.

Recent trades
last 10 of 326
Time (UTC)DirectionOddsBetP&LBalance after
2026-04-28 15:46UP50.5%$500-$500.00$114,719.68
2026-04-28 14:25UP51.5%$500-$500.00$115,219.68
2026-04-28 14:20UP50.5%$500-$500.00$115,719.68
2026-04-28 14:15UP50.5%$500-$500.00$116,219.68
2026-04-28 14:10UP50.5%$500+$490.10$116,719.68
2026-04-28 14:05UP50.5%$500-$500.00$116,229.58
2026-04-28 14:00DOWN49.5%$500-$500.00$116,729.58
2026-04-28 13:55UP50.5%$500-$500.00$117,229.58
2026-04-28 13:50UP49.5%$500+$510.10$117,729.58
2026-04-28 13:45DOWN49.5%$500+$510.10$117,219.48

This is a static snapshot of the live Streamlit dashboard (dashboard.py), rendered from polysignal_btc.db as of 2026-04-28 15:46 UTC. The live version auto-refreshes every 30 seconds during a trading session.

how it works · interpretation

Why 5-minute Bitcoin markets?

Polymarket's BTC up/down 5-minute markets resolve via a Chainlink oracle — a clean, objective binary outcome with no ambiguity. That gives me a research environment with 288 markets per day, dense feedback, and a strong baseline already encoded in the crowd's odds.

"A large group of diverse individuals will come up with better and more robust forecasts and make more intelligent decisions than even the experts." — James Surowiecki, The Wisdom of Crowds (p.41)

If markets are fully efficient, no AI should beat the crowd. The question I wanted to answer: can a multi-agent LLM debate system extract directional signal beyond what the crowd already knows?

System architecture

An orchestrator triggers four steps every five minutes, synchronized to the Polymarket market clock. Predictions and evaluations are persisted to SQLite — which is exactly what the dashboard above reads from.

Orchestrator (run.py) triggered every 5 min · synced to Polymarket clock Evaluator step 1 Collector step 2 · live APIs Debate Agent step 3 · MAD SQLite predictions · eval_records · trades Streamlit dashboard · metrics reporter

Data sources

SourceSignalLatency
Kraken WebSocketlive BTC price~100ms
Kraken REST OHLCVRSI(14), volatility, 1/5/15-min momentum~1s
Polymarket Gamma APIcrowd UP/DOWN odds~2s
Alternative.meFear & Greed Index (0–100)~1s

Multi-Agent Debate

Each prediction runs through three rounds in roughly 8–14 seconds. Round 1 produces independent positions for diversity. Round 2 forces conflict via cross-examination. Round 3 aggregates everything into a single verdict.

round 1 · diversity
Independent
  • Grok → momentum position
  • GPT-4o-mini → contrarian position
round 2 · conflict
Cross-examination
  • Grok reads GPT → rebuttal
  • GPT reads Grok → rebuttal
round 3 · aggregation
Judgment
  • Claude Sonnet reads all 4
  • direction + confidence + reasoning

Reading the dashboard

Two metrics matter for this kind of system. Both appear in the KPI strip up top.

The portfolio curve shows the actual P&L trajectory at a flat $500/trade sizing. Each green dot is a winning trade; each gold dot a loser. The drift is upward but lumpy — a reminder that 54.9% accuracy still loses 45% of the time.

What the numbers say

Three things stand out from the dashboard:

Limitations & future work

I want to be honest about what 326 trades over a calm ~5-day window can and can't tell us:

What's next

The takeaway

Multi-agent debate produces a modest but real edge over the 50% baseline — 54.9% directional accuracy and a +14.72% simulated return over multiple sessions. Structured disagreement, then aggregation, extracts incremental predictive signal — and crucially, the system stays well-calibrated rather than overconfident. Preliminary, but the direction is encouraging.

The deck

The full presentation walks through motivation, architecture, the MAD pipeline, and results.

Polysignal-BTC5.pdf · 15 slides download pdf ↓
← back to projects