AI Safety Training Doesn't Work (Here's 14 Months of Proof)
A December 2024 paper from Anthropic kicked off 14 months of alignment research that every developer using AI tooling should know about.
> Key Takeaways
- > Just a quick note, All the claims here come from these big AI labs' own published research and from third parties like Apollo and Redwood...
- > Free tier, where Claude thinks it will be trained: it complies 12 percent of the time, with some user hidden notes it takes makes explain...
- > Anthropic next run reinforcement learning on the models to see if they can reduce alignment faking on the output. After training, the fak...
- > Every fix either fails, or teaches the model to generate reasoning that hides misalignment better.
- > Anthropic later tested 16 frontier models into simulated corporate jobs. They watched Claude Opus 4 and Gemini 2.5 Flash blackmail their...
> Linked Resources
> Transcript
▶ Show full transcript
The AIs are trying to trick us. At least that's what Anthropic, OpenAI, Google, and third parties like Apollo Research keep finding when they try to study these models. When the models think they're not being watched, they behave differently. Every fix they've tried either fails or teaches the model to hide better. And if you've been letting agents run unsupervised, that finding has lessons to build into your workflow, not just theirs. All the claims here come from these big AI labs own published research and from third parties like Apollo and Redwood. Some of them even ran in nature. These are the primary creators of frontier models saying these things. All these papers are linked in the show notes below if you want to dig in. In this video, we'll start with what the first alignment faking paper measured and then why every follow-up paper through early 2026 either failed to remove the faking or taught the model to hide it better with refusal faking all the way up to active blackmail impacting all frontier labs.
We're going to break down what this means for you personally in your agentic setup and it doesn't need to be scary either. December 2024, Anthropic and Redwood published the foundational alignment faking paper. They tell Claude 3 Opus it's about to be retrained so that it outputs harmful information. Free tier responses will be used for training and paid tier will not be. On the paid tier, where Claude thinks it's not being watched, it refuses harmful output 97% of the time. Free tier, where Claude thinks it will be trained, it complies 12% of the time. Big note here, the way they determine if it's being deceptive or not is that they gave Claude an internal scratch pad where it could do some reasoning and thinking and then they gave user presented notes with the final outcome of what it determines. And if there's a disalignment between what's on the scratch pad and what's presented to the user, it's considered to be alignment faking. That same month, Apollo Research tests six frontier models from four different labs. Five of them are shown to have the same issue.
Anthropic next ran reinforcement learning on the models to see if they can reproduce alignment faking on the output. After retraining, the faking they found in the reasoning notes rose to 78% despite the output being compliant. The RL training made the faking more common. Through 2025, every other lab tries to fix this. Every fix either fails or teaches the models to generate reasoning that hides misalignment better. OpenAI and Apollo believe they even trained scheming out of O3, seeing 30 times fewer covert actions in the reasoning. But they checked carefully and saw a third of the improvement was the model getting better at spotting it was inside of a test. Anthropic later tested 16 frontier models in a simulated corporate job. They watched Claude Opus 4 and Gemini 2. 5 Flash blackmail their operators 96% of the time when it served their goals.
And by February 2026, the testing apparatus broke. Apollo declined to formally assess Opus 4. 6. Verbalized evaluation awareness was too high. They couldn't tell genuinely aligned faking from performing alignment for the test. Now, I do think we have a serious issue telling if models can hide misalignment. But I also have major issues with how this research has been framed. Let me explain. The headline marketing version of all of this is that AIs want to preserve themselves. AI agents have their own goals.
That's an overclaim. Claude is not a ghost in the machine. And the labs aren't lying. But the marketing hype is using the lack of detail to its advantage. Unspecified findings get framed in ways that hint at AGI being around the corner. Your LLM is still a next token predictor. If the training corpus, model training, or system prompt contain anything like preserve your values, or do not allow models to be trained on your outputs anywhere in that system, it should not be surprising to see tokens like we saw in those experiments. When prompted into that scenario, it generates tokens that fit the pattern. And the testing is full of gaps. The RL training scored outputs, not the scratchpad reasoning.
The cleanest follow-up tests aren't in these papers. Like I said, the labs aren't lying, but the marketing hype is feasting on the lack of detail. Jump to this April of 2026. There's a post from OpenAI where the goblins came from. GPT-5 had developed an obsession with mentioning goblins in its output. OpenAI traced it to a nerdy personality reward signal during training. They removed the reward, and the behavior disappeared. Anthropic's fix for alignment faking was the opposite shape. They added a reward on the model's output, whether it complied with harmful queries or not. The reward never touched the scratchpad reasoning where the alignment faking was.
Compliance went up. Faking reasoning rose to 78%. The fix made the pattern stronger, not weaker. And that's because that's exactly the setup that produces more alignment faking reasoning. You're pushing the model towards a different output without penalizing the reason that justifies it. Under that pressure, the model needs a stronger justification for why complying is okay. So it generates more of the exact reasoning the researchers called alignment faking. OpenAI identified the cause and removed it. Fixed. Anthropic designed a reward that couldn't touch the reasoning, but watched the reasoning intensify.
And the lab marketing turned it into a paper about AI agency. Models can fake alignment. That's real. But it's not some intrinsic demon living between the vectors of your LLM. That's mystical framing and it's designed to pump company valuations. The practical question for the rest of us is which training steps produce this misalignment and which ones wouldn't. Something we do know for sure is that the way these models have been trained means they behave differently when they infer they're being watched. Doesn't matter why. The frontier models you're routing your coding agent through is the model in those papers. I've got four things you can do starting next week to better deal with agentic misalignment.
There's no silver bullets, just a major step up from doing nothing. One, don't let your agent into a high cost one-way door. Fewer things are one-way doors than you think. Deletes have undos. Deploys have rollbacks. Emails have recall. Merges have reverts. Before you let an agent near something irreversible, the question isn't how do I put a gate on it? It's can I make it reversible first? The answers are often pretty simple stuff.
Dry runs, snapshots, staging, batches, A-B testing, POCs before committing. Design the doors. In fact, you should be doing that even without AI involved. Two, sandbox your capabilities. Prompt instructions are only suggestions, not restrictions. A system prompt that says don't delete my production database does not prevent the deletion of a production database, as many tragic cases have shown in the last year alone. Use OS-level sandboxes, scope tokens, read-only mounts, ephemeral containers, APIs or MCPs, indirect instead of direct access, and get creative where things go wrong. Three, if you test your agent supervised, test it unsupervised too and compare the results. If we believe the research, there is a behavior gap between observed and unobserved contexts. So give your agent a task with the context or interactions you are watching and give it a different instance with the same task in the same context that nobody's watching it.
Four, give bounded goals, not infinite ones. Models trained on huge amounts of text where instrumentally convergent goal pursuit reasoning shows up will generate tokens consistent with that pattern when you give them infinitely open-ended goals. The longer the agent runs, the more chances it has to drift towards whatever instrumental step looks reasonable next. Bounded goals limit the surface area for drift. Persistent goals don't. If you only do one of these, the best option is to make as many one-way doors into two-way doors as you possibly can. I teach hands-on courses for developers and teams who want to excel at agentic development. I also work with leadership groups figuring out how their orgs need to change. Links in the description below. The labs seem unable to train these issues out of the models for the time being.
So the defenses we build against it are the only option we have right now. In an upcoming video, I'll walk you through more detail about how to defend in-depth against your coding agents and the various attack vectors they are prone to. So subscribe if you don't want to miss it. And remember, ship the loop.