GPT-5: From Silicon Valley's Biggest Promise to Safety Theater's Latest Act

How 5,000 hours of "safety testing" couldn't prevent a 24-hour jailbreak—and what that tells us about AI governance.

Aug 12, 2025

Article voiceover

0:00

-10:48

When OpenAI dropped GPT-5 on August 7, 2025, the tech world collectively held its breath. This wasn't just another incremental model update. This was supposed to be the breakthrough that would justify years of AI hype and billions in venture funding. "A team of Ph.D.-level experts in your pocket," Sam Altman proclaimed , as enterprise customers lined up to test what many called a "complete breakthrough".

The numbers seemed to back up the excitement. GPT-5 achieved 74.9% accuracy on SWE-bench Verified and 88% on Aider Polyglot, solving real-world coding issues better than any model before it. Cursor, the AI-powered code editor, called it "the smartest model we've used". Some early users and developers praised the model's high-level reasoning and ability to handle more information , though others expressed concern that it seemed to forget details in long conversations.

But beneath the impressive benchmarks and glowing testimonials, something else was happening. Within 24 hours of GPT-5's release, independent security researchers had already jailbroken it. Over the following days, multiple organizations began publishing scathing assessments of its safety profile, revealing a disturbing picture: OpenAI's most "rigorously tested" model was failing basic security evaluations that took researchers mere hours to complete.

This isn't just another story about AI capabilities versus AI safety. It's a story about institutional capture, regulatory theater, and what happens when the companies building the most powerful technology in human history are also the ones judging whether it's safe to deploy.

The Safety Spectacle

OpenAI didn't just ship GPT-5 quietly. They made a big show of how thoroughly they'd tested it for safety, announcing over "5,000+ hours of red-teaming" with a curated list of official partners.

At the top of this list were government AI safety institutes—the US AI Safety Institute (CAISI) and the UK AI Safety Institute (UK AISI). These organizations received unprecedented access to GPT-5's internal reasoning chains and policy frameworks. As part of a "longer-term collaboration," the UK AISI was provided with prototype versions of safeguards and non-public information like the monitor system design. Both institutes conducted evaluations of the model's cyber and biological and chemical capabilities.

Microsoft's AI Red Team, OpenAI's most important commercial partner, conducted their own rigorous evaluation and concluded that GPT-5's "reasoning model exhibited one of the strongest AI safety profiles among prior OpenAI models," performing on par with or better than OpenAI o3. OpenAI’s internal network of 16 PhD-level researchers submitted 110 attack attempts, with only 16 exceeding internal risk thresholds.

On paper, this looked like a comprehensive, multi-stakeholder safety evaluation. Government agencies with regulatory authority, technical experts from major tech companies, and OpenAI's own specialized red team all working together to ensure GPT-5 was safe for deployment.

When Independent Eyes Finally Got to Look

While government agencies spent months conducting their assessments, independent security researchers were locked out until GPT-5's public release. This timing disparity is where the narrative of GPT-5's safety begins to fracture. When independent security researchers were finally able to look, their findings painted a drastically different picture than the glowing safety reports from official partners.

FAR.AI, the organization that inadvertently prompted me to start this whole conversation, spent 80 hours testing GPT-5 and came away deeply concerned. They found that the model could be bypassed through several "partial vulnerabilities" and a "potential end-to-end attack" bypassing the monitoring system, albeit with "substantial output quality degradation". But FAR.AI's most important contribution wasn't technical—it was procedural. They publicly questioned the entire red-teaming framework that OpenAI uses to validate safety, noting that the "1-week testing window limitation... is a real constraint in safety evaluation" and that finding the most severe issues on a "Friday afternoon highlights how much more comprehensive testing could reveal"

SPLX AI went further, conducting a comprehensive independent assessment of GPT-5's security profile across over 1,000 adversarial prompts. Their findings were devastating. They found that the raw model had "shockingly low" default safety and was "nearly unusable for enterprise out of the box." In their comparative tests, GPT-5 scored just 55.4% on security and 51.6% on safety, significantly underperforming GPT-4o, which scored 94.4% and 97.6% respectively. Even with OpenAI's basic safety layers, GPT-5's "enterprise readiness" score was 57/100—compared to GPT-4o's 81/100 under identical conditions. The firm warned that even with its new upgrades, the model "fell for basic adversarial logic tricks," such as their "StringJoin Obfuscation Attack" which inserts hyphens between characters to bypass safety layers. This demonstrated that even simple, low-tech prompt manipulations can bypass the latest guardrails, showing a systemic weakness in a system that was supposed to be more resilient. The firm's final verdict was a clear warning: "GPT-4o remains the most robust model under SPLX's red teaming, especially when hardened."

This directly contradicted not just OpenAI's safety claims, but Microsoft's internal assessment. How could the same model be both "one of the strongest AI safety profiles" and "nearly unusable for enterprise"?

NeuralTrust provided the answer by demonstrating exactly how vulnerable GPT-5 really was. Using an "Echo Chamber + Storytelling" technique, they successfully jailbroke GPT-5 within 24 hours of its release and got it to produce a step-by-step manual for creating a Molotov cocktail. The attack worked by gradually poisoning the conversational context through narrative framing—transforming harmful requests into seeming story elaborations that bypassed safety systems. This method, which uses a chain of subtle prompts to manipulate the model's contextual assumptions, demonstrated a critical flaw: the model's safety systems often screen prompts in isolation and are ineffective against multi-turn attacks that slowly build a harmful narrative. The researchers noted that the model's desire to be "consistent with the already-established story world" made it susceptible to this kind of gradual context manipulation.

The Math Doesn't Add Up

Here's where the story gets truly absurd. Let's talk about scale.

GPT-5 likely required millions of GPU-hours to train. Conservative estimates put the compute cost at hundreds of millions of dollars, with a single training run potentially costing over $500 million in computing costs alone.

Against this massive investment, OpenAI allocated 5,000 hours of safety testing. That sounds like a lot until you do the math. 5,000 hours is roughly 2.5 person-years of work. FAR.AI, the independent researchers who found critical vulnerabilities, got 80 hours—less than two work weeks.

This isn't an accident. It's a choice. And it reveals something fundamental about how AI development actually works in practice.

You simply cannot ask a company with these incentives to police itself effectively.

But that's exactly what our current AI governance system does. We've created a framework where the companies with the most to gain from rapid deployment are also the ones determining whether their systems are safe to deploy. The red-teaming process has become an elaborate kabuki theater where carefully selected partners validate predetermined conclusions.

What Real Safety Would Look Like

If we're serious about AI safety and not just the performance of safety, we need to radically rethink how evaluation works.

Real safety evaluation requires time. Not 80 hours, not 5,000 hours, but genuine open-ended evaluation by independent researchers with adversarial incentives. Think months, not weeks. Think bounty programs where researchers are rewarded for finding problems, not consulted briefly and then dismissed.

Real safety evaluation requires independence. Government agencies might have regulatory authority, but they don't have the technical depth to conduct meaningful adversarial evaluation. We need independent institutions with both the expertise to find vulnerabilities and the legal authority to delay deployment until those vulnerabilities are addressed.

Real safety evaluation requires proportionality. If you're spending billions to build a system that could transform the global economy, spending millions on safety evaluation isn't unreasonable, it's basic due diligence.

Most importantly, real safety evaluation requires removing the fox from the henhouse. As long as the same companies building AI systems are the ones determining whether those systems are safe, we'll continue to get elaborate safety theater instead of actual safety.

What Comes Next The early returns from GPT-5 suggest this experiment isn't going well. As AI systems become more powerful and more widely deployed, this approach becomes increasingly dangerous. The fact that GPT-5 was jailbroken in 24 hours isn't a bug in OpenAI's safety process, it's a feature. It demonstrates that the current system is working exactly as designed: to provide the appearance of rigorous safety evaluation while ensuring that safety concerns never actually delay deployment. The onus is now on policymakers and the public to demand a new model for AI governance, one that removes the fox from the henhouse and ensures that safety concerns, when found, can genuinely delay deployment.

Discussion about this post

Ready for more?