News Feed

Anthropic's AI safety system blocks 95% of jailbreak attempts in tests – THE DECODER

THE DECODER
Artificial Intelligence: News, Business, Research
The AI company Anthropic has developed a method to protect language models from manipulation attempts.
Anthropic has developed a new safety method called “Constitutional Classifiers” to prevent people from tricking AI models into giving harmful responses. The technology specifically targets universal jailbreaks – inputs designed to systematically bypass all safety measures.
To put the system through its paces, Anthropic recruited 183 people to try breaking through its defenses over two months. The participants attempted to get the AI model Claude 3.5 to answer ten prohibited questions. Even with $15,000 in prize money and roughly 3,000 hours of testing, no one managed to bypass all the safety measures.
The initial version had two main drawbacks: it flagged too many innocent requests as dangerous and required too much computing power. While an improved version addressed these issues, as shown in automated tests with 10,000 jailbreak attempts, some challenges remain.
Check your inbox or spam folder to confirm your subscription.

The tests revealed that while an unprotected Claude model allowed 86 percent of manipulation attempts through, the protected version blocked more than 95 percent. The system only incorrectly flagged an additional 0.38 percent of harmless requests, though it still needs 23.7 percent more computing power to run.
The safety system works by using predefined rules about what content is allowed or prohibited. Using this “constitution”, it creates synthetic training examples in various languages and styles. These examples then train the classifiers to spot suspicious inputs.
The researchers acknowledge that the system isn’t foolproof against every universal jailbreak, and new attack methods could emerge that it can’t handle. That’s why Anthropic suggests using it alongside other safety measures.
To further test the system’s strength, Anthropic has released a public demo version. safety experts can try to outsmart it from February 3 to 10, 2025, with results to be shared in an update.
Check your inbox or spam folder to confirm your subscription.

Check your inbox or spam folder to confirm your subscription.


source
This article was autogenerated from a news feed from CDO TIMES selected high quality news and research sources. There was no editorial review conducted beyond that by CDO TIMES staff. Need help with any of the topics in our articles? Schedule your free CDO TIMES Tech Navigator call today to stay ahead of the curve and gain insider advantages to propel your business!

Leave a Reply