Polysignal-BTC5:
Can MAD(Multi-Agent-Debate) beat the crowd?
Three AI agents — a momentum analyst, a contrarian, and a judge — debate Bitcoin's next 5-minute price move on Polymarket. The dashboard below is a live snapshot of the system's actual performance over multiple session with total of 326 trades .
DOWN calls outperform UP — and they're rarer, suggesting the model picks its bearish moments selectively.
Momentum (Grok) beats the contrarian (GPT-4o-mini); judge sits in the middle. Vertical line = 50% baseline.
| Time (UTC) | Direction | Odds | Bet | P&L | Balance after |
|---|---|---|---|---|---|
| 2026-04-28 15:46 | UP | 50.5% | $500 | -$500.00 | $114,719.68 |
| 2026-04-28 14:25 | UP | 51.5% | $500 | -$500.00 | $115,219.68 |
| 2026-04-28 14:20 | UP | 50.5% | $500 | -$500.00 | $115,719.68 |
| 2026-04-28 14:15 | UP | 50.5% | $500 | -$500.00 | $116,219.68 |
| 2026-04-28 14:10 | UP | 50.5% | $500 | +$490.10 | $116,719.68 |
| 2026-04-28 14:05 | UP | 50.5% | $500 | -$500.00 | $116,229.58 |
| 2026-04-28 14:00 | DOWN | 49.5% | $500 | -$500.00 | $116,729.58 |
| 2026-04-28 13:55 | UP | 50.5% | $500 | -$500.00 | $117,229.58 |
| 2026-04-28 13:50 | UP | 49.5% | $500 | +$510.10 | $117,729.58 |
| 2026-04-28 13:45 | DOWN | 49.5% | $500 | +$510.10 | $117,219.48 |
This is a static snapshot of the live Streamlit dashboard (dashboard.py),
rendered from polysignal_btc.db as of 2026-04-28 15:46 UTC.
The live version auto-refreshes every 30 seconds during a trading session.
Why 5-minute Bitcoin markets?
Polymarket's BTC up/down 5-minute markets resolve via a Chainlink oracle — a clean, objective binary outcome with no ambiguity. That gives me a research environment with 288 markets per day, dense feedback, and a strong baseline already encoded in the crowd's odds.
"A large group of diverse individuals will come up with better and more robust forecasts and make more intelligent decisions than even the experts." — James Surowiecki, The Wisdom of Crowds (p.41)
If markets are fully efficient, no AI should beat the crowd. The question I wanted to answer: can a multi-agent LLM debate system extract directional signal beyond what the crowd already knows?
System architecture
An orchestrator triggers four steps every five minutes, synchronized to the Polymarket market clock. Predictions and evaluations are persisted to SQLite — which is exactly what the dashboard above reads from.
Data sources
| Source | Signal | Latency |
|---|---|---|
| Kraken WebSocket | live BTC price | ~100ms |
| Kraken REST OHLCV | RSI(14), volatility, 1/5/15-min momentum | ~1s |
| Polymarket Gamma API | crowd UP/DOWN odds | ~2s |
| Alternative.me | Fear & Greed Index (0–100) | ~1s |
Multi-Agent Debate
Each prediction runs through three rounds in roughly 8–14 seconds. Round 1 produces independent positions for diversity. Round 2 forces conflict via cross-examination. Round 3 aggregates everything into a single verdict.
- Grok → momentum position
- GPT-4o-mini → contrarian position
- Grok reads GPT → rebuttal
- GPT reads Grok → rebuttal
- Claude Sonnet reads all 4
- direction + confidence + reasoning
Reading the dashboard
Two metrics matter for this kind of system. Both appear in the KPI strip up top.
- Directional accuracy — did the model predict the correct movement? Treats 51% and 99% predictions equally, so it answers "would this signal make money?" At 54.9% over 326 trades, the system is meaningfully above the 50% coin-flip baseline.
- Brier score — how confident and calibrated was the prediction? Heavily penalizes overconfident errors. 0.25 is the random baseline; the system clocks 0.2490, just under random — which lines up with the calibration panel showing the model rarely strays past 55% confidence.
The portfolio curve shows the actual P&L trajectory at a flat $500/trade sizing. Each green dot is a winning trade; each gold dot a loser. The drift is upward but lumpy — a reminder that 54.9% accuracy still loses 45% of the time.
What the numbers say
Three things stand out from the dashboard:
- Momentum dominates short-term signal. Grok at 55.9% is the strongest single agent. The contrarian (GPT-4o-mini) sits at 47.5% — basically a coin flip — which suggests fading momentum at 5-minute horizons is a losing strategy.
- The judge adds value over the contrarian, not Grok. Claude's 54.7% sits between the two — sometimes deferring to Grok, sometimes blending. When the judge sided with Grok, accuracy jumped above 60%; when it sided with GPT, it cratered.
- Calibration is good but conservative. 308 of 326 calls land in the 50–55% confidence bucket, where actual accuracy was 55.2% — almost perfectly calibrated. The model rarely commits past 55% because the underlying signal genuinely is weak.
Limitations & future work
I want to be honest about what 326 trades over a calm ~5-day window can and can't tell us:
- Statistical power — confidence intervals are still wider than I'd like; a 2,000+ trade run is the goal.
- No live capital — simulated P&L only; the actual market is restricted to US users.
- Weak contrarian — GPT-4o-mini at 47.5% may be too small for the role.
- Single regime — no evaluation across high/low volatility regimes yet.
- Lite models only — reasoning-tier models (Opus, GPT-5) would likely shift the picture.
What's next
- Longer evaluation — extending toward 7+ days, 2,000+ trades.
- Stronger contrarian — swap in Claude Opus or GPT-5 for the contrarian role.
- Adaptive sizing — Kelly-style sizing instead of flat $500/trade.
- Multi-horizon — test 15-min and 30-min markets.
- Live deployment — real capital to validate the simulated results.
The takeaway
Multi-agent debate produces a modest but real edge over the 50% baseline — 54.9% directional accuracy and a +14.72% simulated return over multiple sessions. Structured disagreement, then aggregation, extracts incremental predictive signal — and crucially, the system stays well-calibrated rather than overconfident. Preliminary, but the direction is encouraging.
The deck
The full presentation walks through motivation, architecture, the MAD pipeline, and results.