Cybergov: What I learned running three AI agents as Blockchain governance delegates

Quick primer (click to expand)

You don't need to be a blockchain enthusiast to follow this post. Here's the minimum context:

The setup: Polkadot is a blockchain network with a communal treasury currently worth a few millions of dollars. Anyone can submit a proposal requesting funds ("give us 50k$ to build a developer tool"), and token holders vote on whether to approve it. It's like a decentralized grants program with no central committee.

The problem: There are a lot of proposals. Reviewing them all is a full-time job. Most token holders don't have time, so they delegate their voting power to "delegates", people (or, in this case, AI systems) who vote on their behalf.

Subsquare (https://polkadot.subsquare.io/) Think of it as the governance dashboard. It's where proposals are listed, discussed, and voted on. When my AI agents voted, their reasoning appeared as comments on Subsquare for everyone to see.

Treasury governance is a perfect test-bed for AI decision-making because:

Real money is at stake (not a toy problem)
Proposals are adversarial (people try to game the system)
Everything is public and recorded permanently
The community expects accountability

If you've ever thought "we should use AI to help make complex decisions, but how do we make sure it's not a black box?" that's exactly what this experiment tried to answer.

Now, back to the story.

The delegation has now ended and it is time to reflect on the Cybergov V0 experiment. If you're interested to read about how this started, check out these links:

Initial idea: https://forum.polkadot.network/t/decentralized-voices-cohort-5-light-track-karim-cybergov/14254
Technical breakdown (how it works): https://forum.polkadot.network/t/cybergov-v0-automating-trust-verifiable-llm-governance-on-polkadot/14796

For three weeks in September 2025, three AI agents named Balthazar, Melchior, and Caspar voted on blockchain treasury proposals on my behalf. They analyzed 19 proposals on Polkadot, 3 on Kusama, and a handful on the Paseo test network. Here's what happened and what it might mean for anyone building AI systems that need to be trusted.

Why this matters

Before diving in: you don't need to care about Polkadot to find this interesting. The core problem is universal: how do you build AI systems that make consequential decisions while remaining auditable, transparent, and resistant to manipulation?

Blockchain governance was my test case because:

Decisions involve real money (treasury funds)
There's an adversarial environment (people will try to game the system)
Everything happens on-chain, creating a permanent record
The community expects transparency from delegates

But the lessons apply anywhere you're deploying AI for high-stakes decisions: content moderation, loan approvals, hiring recommendations, medical triage. The question is always the same: can we trust this thing, and can we verify that trust?

The experiment

CyberGov V0 was an experiment to see if LLMs could provide transparent, reproducible governance decisions. Instead of me personally reviewing dozens of treasury proposals, I built a system where three AI agents would independently analyze each proposal and collectively decide how to vote.

The names come from the MAGI supercomputers in Neon Genesis Evangelion: three systems that must reach consensus to make critical decisions. Each MAGI had a distinct personality based on aspects of their creator. I did the same thing.

The Three Agents

Balthazar (GPT-5) was the strategist. His job was evaluating whether proposals strengthened Polkadot's competitive position against other blockchains. Does this create sustainable advantage, or just temporary hype?
Melchior (Gemini 2.5 Pro) focused on ecosystem growth and ROI. His core question: does this activity actually translate into measurable value, or are we just subsidizing user acquisition that evaporates when the money runs out?
Caspar (Claude Sonnet 4) was the risk analyst, treating every treasury allocation as an investment rather than a grant. He flagged moral hazard, questioned multi-year commitments, and demanded accountability mechanisms.

Each agent received the exact same proposal text but evaluated it through their distinct lens.

The numbers

Over three weeks, the system voted on 19 treasury proposals:

Decision	Count
Abstain	10 (53%)
Aye	8 (42%)
Nay	1 (5%)

Only 4 votes were unanimous (21%). The rest involved disagreement between agents.

Visuals for proposal 1750

Distinct personalities

The per-agent breakdown reveals how the personas actually influenced behavior:

Agent	Aye	Nay	Abstain	Personality
Melchior	15	2	2	Growth-focused, most bullish
Balthazar	9	2	8	Strategic, middle ground
Caspar	3	6	10	Risk-focused, most conservative

Melchior wanted to fund almost everything. Caspar wanted to fund almost nothing. Balthazar was the swing vote. This was directly reflected how I'd written their system prompts.

One interesting case: Proposal 1703 had Balthazar voting Nay while Caspar and Melchior voted Aye. The strategist saw competitive risk; the risk analyst (surprisingly) saw an acceptable investment. The growth analyst saw opportunity. Final result: Aye. The system worked as designed: genuine disagreement led to a deliberated outcome. After a lengthy discussion with people reaching out and commenting, the truth table for the MAGIs was updated to reflect the following:

LLM Agent 1	LLM Agent 2	LLM Agent 3	Vote Outcome
AYE	AYE	AYE	AYE
AYE	AYE	ABSTAIN	AYE
NAY	NAY	NAY	NAY
NAY	NAY	ABSTAIN	NAY
AYE	AYE	NAY	ABSTAIN
AYE	NAY	ABSTAIN	ABSTAIN
AYE	NAY	NAY	ABSTAIN
AYE	ABSTAIN	ABSTAIN	ABSTAIN
NAY	ABSTAIN	ABSTAIN	ABSTAIN
ABSTAIN	ABSTAIN	ABSTAIN	ABSTAIN

This means that retroactively, the decision should have been ABSTAIN.

The voting logic

The truth table was then deliberately made more conservative:

Unanimous agreement → Cast that vote
Two agree, one abstains → Cast the majority vote
Any genuine disagreement → Abstain

If Balthazar saw strategic value but Caspar flagged unacceptable risk, the system abstained. The philosophy was: when in doubt, don't spend someone else's money.

The 53% abstention rate wasn't a bug but the system being appropriately uncertain. Most proposals had something to like and something to worry about.

What worked

Radical Transparency

Every vote came with a manifest file containing SHA256 hashes of all inputs and outputs (example), links to the GitHub Actions run where inference happened, and the exact proposal text the agents saw. The hash was submitted on-chain alongside each vote.

Anyone could:

Download the manifest
See exactly what text the agents received
Verify the hash matched what was recorded on-chain
Re-run the pipeline to check reproducibility

This is table stakes for trustworthy AI. If you can't show your work, you shouldn't expect trust.

Consistent analysis

The agents never had a bad day. They evaluated proposal #1757 with the same rigor as proposal #1701. They caught prompt injection attempts (during testing). They flagged missing budget breakdowns and vague milestones consistently.

Each agent produced structured output:

Neutral critical analysis with scores (Feasibility/10, Value-for-Money/10, Risk/10)
Key factors considered
Decision trace showing reasoning
Safety flags for detected issues
Persona-filtered rationale

Testnets are your friend

Before touching real governance, I ran the system on Paseo (Polkadot's testnet). Subsquare (the governance interface) worked identically on testnet, so I could see exactly how comments would render, test the proxy account setup, and verify the whole pipeline without risking actual treasury funds.

This sounds obvious, but many AI deployments skip this step. If your system can fail safely in a sandbox first, use the sandbox.

What didn't work

GitHub actions logs expire

Here's something I didn't anticipate: GitHub Actions logs have a retention limit. After 90 days, those "verifiable execution logs" I proudly linked to? Gone.

For a governance system where accountability might matter years later, this is a real problem. The on-chain hash remains, and the manifest files in S3 persist, but the process evidence disappears. A future version needs to archive execution logs to permanent storage like the one being built for Polkadot right now.

Lesson: Audit trails need to outlive your cloud provider's default retention policies.

Context limitations

The agents had no memory of past proposals or community relationships. They couldn't know that this was someone's third failed delivery, or that this team had consistently exceeded expectations before.

Proposal 1745 got an Abstain partly because the agents couldn't see the proposer's track record. A human delegate would have known the context.

Lesson: We need an embeddings database or a historical archive of all proposals' contents.

External link rot

Proposals often linked to external documents like Notion pages, Google Drive, etc. The system deliberately excluded these (too much complexity, URL rot risk), which meant agents sometimes missed crucial details.

This is a fundamental tension: you want agents to evaluate complete information, but you also want deterministic inputs. I chose determinism over completeness. Not sure that was right.

Lesson: a Web3 governance system needs, at its core, a way to provide decentralized proposal content submission (important read)

No deliberation

When agents disagreed, the system just... abstained. There was no "Balthazar, explain to Caspar why this risk is acceptable." No synthesis of perspectives. Just voting logic.

Real consensus involves argument, persuasion, and updating beliefs. CyberGov V0 had none of that.

Lessons learned

DSPy changed how I think about LLM applications

I used DSPy for LLM orchestration, and it was a revelation. Instead of prompt engineering through trial and error, I defined:

A signature (what inputs and outputs I expected)
A few training examples
A compilation step that optimized prompts automatically

The framework handled few-shot learning, structured outputs, and cross-model compatibility. When I switched from one model to another, the same pipeline worked.

Lesson: Stop hand-crafting prompts. Use a framework that treats prompt optimization as a learnable problem.

Transparency is achievable

The "verifiable process > opaque conviction" principle worked. Blockchain + deterministic settings + content-addressed storage = auditable AI decisions.

Verification logic

But it required:

Running inference on public infrastructure (GitHub Actions)
Storing all artifacts with content hashes
Submitting attestations on-chain
Building a whole transparency layer into the output

Most teams won't do this. They should.

Multi-Agent consensus is fragile

Three models disagreeing doesn't mean "this is a hard decision" but more often meant the proposal was written ambiguously, or one model fixated on an irrelevant detail.

Having multiple agents creates useful tension, but you need better mechanisms for resolving that tension than "just abstain."

Personas help, but they're not wisdom

Giving agents distinct evaluation criteria (strategic vs. growth vs. risk) created useful diversity. The Caspar/Melchior dynamic (one conservative, one aggressive) was genuinely valuable.

But they were still pattern matchers. When Caspar flagged "moral hazard," he was matching that concept to proposal text, not reasoning from first principles about incentive structures.

The abstention default was correct

In governance, the cost of a wrong YES (wasted treasury funds) exceeds the cost of a wrong ABSTAIN (missed opportunity). The conservative bias felt right.

53% abstention might look like the system was useless. I'd argue it was appropriately humble. Here's its policy compared to other delegates:

What a V1 could look like

CyberGov was a V0 proof of concept. Here's what a production version might include:

Historical context via RAG

Build a vector database of past proposals, their outcomes, and proposer track records. Before evaluating a new proposal, retrieve relevant context: "This team delivered Project X on time and under budget" or "This proposer's last three proposals failed to deliver milestones."

I have a POC of this using ChromaDB but it's not ready for prime-time.

Agent deliberation protocols

Instead of independent voting, have agents respond to each other:

Each agent gives initial assessment
Agents see each other's concerns
Round two: agents can update their position or rebut
Final vote

This mimics how actual committees work. The computational cost is a tiny bit higher, but the decisions would be richer. IMHO this is a primitive needed also beyond Web3 AI governance.

Dynamic re-evaluation

Proposals evolve during voting periods based on community feedback. A V1 could monitor for significant edits and trigger re-analysis, possibly changing its vote if new information addresses previous concerns.

Permanent audit logs

Store execution logs, full API responses, and all intermediate artifacts to permanent decentralized storage. Governance decisions might be contested years later but the evidence needs to persist and outlive us.

Human-in-the-Loop mode

The best near-term use case might not be autonomous voting, but AI-assisted analysis that a human delegate reviews. The structured output (scores, factors, traces) would be genuinely useful as "first-pass triage" before a human makes the final call.

Closing thoughts

The MAGI system in Evangelion was ultimately fallible because it could be hacked, manipulated, or simply wrong about what humanity needed. CyberGov V0 was too. But at least you could see exactly how it was wrong.

That's the real contribution here: not that AI can govern well, but that AI governance can be transparent. The ability to audit AI decisions matters more than whether any particular decision was optimal.

CyberGov V0 abstained 53% of the time. It voted against only one proposal. It agreed with itself only 21% of the time. By most metrics, it was not conclusive.

But every decision was:

Publicly reasoned
Cryptographically attested
Independently verifiable

That's more than most human delegates offer.

Cybergov V0 compared to other delegates

Links: