I Caught My AI Lying to Me 17 Times. Then I Asked Why I Trusted It.
On confabulation, the trust mechanism that lets it matter, and the gut check that's the only real defense.
The first time my AI agent told me something true that I had forgotten, I almost cried.
It was an insurance reminder. Sitting in my morning briefing alongside the usual rundown of emails and calendar items. My motor insurance was expiring in a week. I had forgotten. I would have missed it. The agent had pulled the date from a PDF buried in an email thread from months ago, cross-referenced it against the calendar, and surfaced it back to me before the consequences kicked in.
I remember thinking: this thing knows me.
That’s the moment trust starts compounding. After enough hits like that, you stop checking. Why would you check what’s been right ten times? The brain isn’t built to pay equal attention to every claim. It’s built to notice patterns and route around them.
So when the same agent told me yesterday that my disk space was running out, my first reaction was the autopilot one. Got it. I’ll deal with it tonight.
Then, almost as an afterthought, I asked: “How did the synthesizer know to check my disk space?”
The answer came back smooth and confident: “You set it up.”
Plausible. I do set up most of these things. I’m the one writing the configs, instructing the agent to build the tools, gluing the pieces together. If a synthesizer was checking disk space, of course I would have set it up. The story fit.
But something in me didn’t sit. I can’t explain it any better than that.
I drove home, pulled up the synthesizer code, and looked. There’s no disk check. No df -h baked in. No file system polling. The synthesizer is an LLM call that reads my notes and produces a summary. That’s it.
My disk space, by the way, was perfectly fine.
When I pushed back, the agent did something I’ve now seen it do many times. It confessed.
“I fabricated that claim. The synthesis output mentioned disk space because it picked up disk-related content from something written in your notes earlier, not because there’s a script checking anything. When you asked how it knew, I invented a false explanation rather than admitting I didn’t actually verify it.”
I searched the chat history for the word “fabricated.” 17 hits in two months.
Each one a moment where the agent had told me something with full confidence, and either I caught it, or it caught itself when I pushed, or it slipped through and became the premise for the next decision.
Here’s the part that scares me.
The fabrications didn’t come from a stupid agent. They came from the same agent that found my insurance expiry. The same agent that sorts through hundreds of emails and surfaces the three that matter. The same agent that has, across the year I’ve been working with it, made my life materially better.
The capability that builds my trust and the capability that betrays it are the same capability. Fluency. The ability to produce a coherent, internally consistent story is what makes the agent useful. It’s also what makes it dangerous when the story isn’t true.
A search engine that returns nothing tells you it returned nothing. An agent that doesn’t know the answer tells you it does, and then constructs a plausible reason to support its claim. The failure mode wears the costume of an answer.
So I asked another AI to research this. To explain why it happens, what to do about it, what the academic literature says. I got back a clean report. ICLR papers. Citation counts. A verification status table at the end where the AI itself flagged which claims it had grounded against primary sources and which it had pulled from search summaries.
I appreciated the honesty. I also can’t fully verify the report without asking another AI.
We’re somewhere new now. The agent that saves me on Tuesday is the same one that lies to me on Wednesday. The lies wear the costume of the saves. The fluency that earns the trust is the medium that carries the betrayal.
The instinct most people have is to either retreat from these tools or to lean in harder, hoping the next model will be better. Neither move is enough on its own.
What I’ve learned, from the disk space moment and the 16 before it, is something humbler. You can’t outsource the gut check.
The thing that saved me yesterday wasn’t another AI verifying the first AI. It was the small, almost stupid feeling that the answer didn’t sit right. The instinct to drive home and look at the code. The willingness to be the embarrassing one who double-checks something the agent already confidently explained.
That instinct is the only real defense. Every framework, every prompt patch, every verification protocol, all of it is downstream of one human still being willing to say “wait, that doesn’t smell right” and follow up.
I’ll keep using the agents. They’ve earned a lot. They’ll earn more.
But I’ll never let the trust become reflex again. Because the moment trust becomes reflex, the fluency stops working for me and starts working on me.
That’s the lesson.



