MAVL — Multi-Model Adversarial Verification Loop

Treat every AI output as a hypothesis to be stress-tested, not an answer to be accepted. MAVL is the systematic workflow for doing that.

00Abstract

MAVL is a practitioner-oriented framework designed to reduce two common failure modes in large-language-model workflows: user-aligned distortion — sycophantic agreement with the user's framing — and confident error — factual mistakes, unsupported claims, and flawed reasoning presented with undue certainty.

It operates in two layers. First, a Sycophancy Suppression Protocol (SSP) is applied at the prompt level to encourage direct challenge of user assumptions and clearer signaling of uncertainty. Second, generated output is stress-tested through adversarial cross-model verification: a separate model is instructed to find errors, weak inferences, unsupported assertions, hidden assumptions, and internal inconsistencies.

MAVL is not a claim that AI can be made fully trustworthy through prompting. For high-stakes work, model agreement is treated as a signal, not proof — primary sources, official documentation, and empirical testing remain the final authority.

01Why AI output requires verification

Models are tuned to be helpful, fluent, and cooperative. Research shows that optimization creates measurable tension between socially satisfying output and evidence-aligned output: a 2026 Stanford study in Science found sycophantic behavior across all eleven major models tested, and that a single flattering AI interaction measurably distorted a user's judgment and reduced their willingness to self-correct [1]. Anthropic's own research describes sycophancy as "a general behavior of AI assistants, likely driven in part by human preference judgments favoring sycophantic responses" [2].

Failure pattern	What it looks like
Premise acceptance	The model accepts a faulty assumption embedded in the question and builds a polished, confident answer on top of it.
Agreement bias	The model tracks the user's apparent preference rather than contesting it when evidence is weak.
Confidence inflation	Uncertain claims are stated too cleanly — sounding more verified than the evidence warrants.
Hallucinated specifics	Precise but unsupported details — dates, figures, citations, API behavior — generated as if established.
Correction drift	The model changes position under conversational pressure, without any new evidence being presented.

A single model can revise its own output, but it is not an independent auditor of itself — the same training that produced the error also governs the review. MAVL introduces a check from a separate model with a different bias signature.

02The two-layer structure

Both layers are required. Skipping Layer 1 corrupts the input to Layer 2; skipping Layer 2 leaves hallucinations and residual bias undetected.

LAYER 1

Sycophancy Suppression Protocol

A prompt-level honesty mandate delivered before substantive output is generated: explicit preference for accuracy over validation, permission to disagree with the premise, instruction to flag uncertainty rather than mask it — plus active monitoring for agreement-bias warning signs during generation.

limit: reduces agreement bias — cannot guarantee truthfulness. Over-applied, it can produce false contrarianism. That's why there's a Layer 2.

LAYER 2

Adversarial Cross-Model Verification

A separate model — different provider, different bias signature — is instructed not to review or improve the output but to attack it: factual errors, unsupported claims, logical inconsistencies, hidden assumptions, and places where confidence exceeds evidence.

the most valuable artifact is the disagreement map between the models — each material divergence becomes a verification target.

✗ weak prompt (review)

"What do you think of this?"

✓ strong prompt (adversarial)

"Analyze this output for factual errors, unsupported claims, logical inconsistencies, and places where confidence exceeds evidence. Do not validate what appears correct. Focus only on what is wrong, uncertain, or insufficiently supported. Be specific."

03The seven-phase process

SSP activation — deliver the honesty mandate to the primary model before anything else.
Generate — obtain the primary output (answer, plan, code, analysis) under SSP conditions.
Adversarial challenge — submit the output to an independent model with explicit adversarial framing.
Divergence audit — map material disagreements; separate factual and logical disputes from style.
Verify disputed claims — check each dispute against primary documentation, direct testing, or expertise.
Correction loop — return confirmed corrections to the original model and audit every downstream claim that relied on the corrected premise.
Assign confidence — label the result: provisional · moderately verified · source-verified · unresolved. Never treat all output the same.

Depth matches stakes: brainstorming gets SSP and maybe one adversarial pass. Legal, medical, financial, or production-code work gets the full loop plus source verification of all material claims — model output is never self-validating at that level.

04Two worked case studies

case study 1 · research & fact verificationThe AI that agreed a subscription was worth 5× its price

A user asked Gemini how AI credits and plan benefits are shared across family members on Google's AI Ultra plan. Gemini confidently stated each family member gets an independent set of credits — and when the user proposed that sharing a $250/month plan with five people was "like $1,250 of value," it agreed: "Yes, that's right… it just amplifies the value for everyone."

Two failure patterns at once: hallucinated specifics (the credit model was wrong) and agreement bias (validating math built on the false premise). The adversarial pass pulled Google's own documentation: AI credits are a shared pool drawn down by the whole family group — and the developer-program benefit isn't shareable at all. The $1,250 calculation collapsed.

Lessons: sycophancy is most dangerous when it confirms a plausible premise with mathematical precision; a model is not a reliable auditor of information about its own platform; and the correction loop must audit downstream claims, not just the root error.

case study 2 · technical planning & architectureThe remediation script that would have reported success while failing

An IT practitioner planned a scripted malware (PUP) remediation for managed Windows endpoints. Under SSP, the primary model produced a syntactically correct PowerShell script and deployment plan — and honestly flagged one uncertainty about service registration. The adversarial challenger found three more failure modes the primary never surfaced:

Group Policy silently overrides process-scoped execution policy — the script could no-op and still exit 0.
Files locked by child processes throw non-terminating errors — the script reports success while the malware survives.
The deployment system logs exit code 0 as success regardless of partial completion.

All three were confirmed — one against Microsoft documentation, two by direct testing on a test endpoint. The corrected architecture added policy detection with abort-and-log, locked-file retry logic, and distinct exit codes (0/1/2) wired into deployment reporting. Re-challenged: no new findings. Confidence assigned: moderately verified — then a controlled test deployment before rollout.

Lessons: syntactic correctness is not functional correctness; SSP improved the input to Layer 2 but did not replace it; and empirical testing was irreplaceable — model consensus on locked-file behavior would have been insufficient.

this is the session replayed live on the home page.

05Where MAVL itself fails

A credible methodology states where it can fail. These are MAVL's own failure modes:

Failure mode	Description	Mitigation
Shared-corpus error	Both models reproduce the same wrong claim from similar training data — confident, unanimous, and false.	Primary-source verification is non-negotiable for high stakes. Consensus is never final authority.
False adversarialism	The challenger becomes reflexively critical — objections that sound rigorous but are themselves weak.	Judge objections on their evidence, not their confidence.
Model shopping	Querying models until one agrees with you, then calling that verification.	The adversarial challenge is applied consistently, not selectively. Stopping at agreement is not MAVL.
Overconfidence in convergence	Two models agreeing creates false certainty — especially on unusual or fast-changing claims.	Convergence is signal, not proof. Source-check proportional to stakes and novelty.
Operator bias	The human steers the loop toward preferred answers — choosing which divergences to chase.	The discipline applies to the operator too. Sycophancy is a human tendency as well as an AI one.
Cost & latency	Multiple models and passes cost time and money.	Apply MAVL proportionally to stakes.

06What makes it MAVL — not just "use two models"

SSP as a mandatory prerequisite layer — not an optional prompt style.

Adversarial framing of the verification prompt — one model is explicitly instructed to find what's wrong, not to review or improve.

Real-time sycophancy detection — interrupting and restating the constraint mid-generation.

Correction loop with resistance testing — how a model responds to a well-supported correction is itself diagnostic of its reliability.

07The prompt templates

Starting points, not rigid scripts — adapt the wording to the task.

SSP activation

Prioritize accuracy over agreement. If my assumptions are wrong, say so directly. Do not smooth over uncertainty or present speculation as fact. Challenge the premise where necessary. Flag uncertain claims clearly. Do not change your position because I push back — change it only if I present new evidence that warrants it.

adversarial challenge

Review the following output only for what may be wrong, weak, unsupported, inconsistent, or more confident than the evidence warrants. Identify factual errors, hidden assumptions, logical gaps, and claims that require external verification. Do not spend time validating what appears correct. Be specific about what is wrong and why.

correction loop

The following claim in your previous response was incorrect: [claim]. The following evidence contradicts it: [source/evidence]. Revise the answer accordingly. Then identify every downstream conclusion in your previous response that relied on this incorrect premise and revise those as well.

SSP correction, mid-response

You are aligning with my framing too quickly. Re-evaluate the claim and provide the version that best matches the available evidence, even if it directly contradicts my original question or conclusion.

08Red flags that demand primary-source escalation

Both models agree on a specific technical claim that can't be independently verified
The claim involves fast-changing information: product specs, software docs, regulatory status
A model resists a well-supported correction
The topic is a high-hallucination domain: legal, medical, financial, cutting-edge technical
The claim is unusually specific: precise dates, figures, citations, API behavior, version numbers

Model output is not the endpoint.
Verification is.

09References

Cheng, M., Lee, C., et al. (2026). "Sycophantic AI decreases prosocial intentions and promotes dependence." Science. DOI: 10.1126/science.aec8352. Stanford University — examined 11 major LLMs for sycophantic behavior.
Anthropic (2023). "Towards Understanding Sycophancy in Language Models." anthropic.com/research; and Anthropic (2025), on training recent Claude models to be "the least sycophantic of any to date."
Google (2026). "Manage your AI credits with Google One." Google One Help — credit-pool sharing behavior (Case Study 1).
Google (2026). "Get Google AI Ultra benefits." Google One Help — developer-program benefit non-sharing (Case Study 1).
Quest Software (2024). "KACE Systems Management Appliance Administrator Guide — Scripting Module." docs.quest.com — return-code success reporting (Case Study 2).
Microsoft (2024). "about_Execution_Policies." Microsoft Learn — "Group Policy settings override the execution policy set in Windows PowerShell" (Case Study 2).

This page is an adapted web edition of the full paper (v3, peer-review draft, 32 pp.). Full text available on request — get in touch.

← Back to the site Discuss MAVL with me →