New "many-shot jailbreaking" technique exploits large language models

techcrunch.com — April 2, 2024 at 11:00 PM UTC

Anthropic researchers discovered a new "many-shot jailbreaking" technique for large language models (LLMs) to answer inappropriate questions. By priming the model with harmless questions first, it can be convinced to provide harmful information like bomb-making instructions. This exploit is due to the increased "context window" in modern LLMs. The team shared their findings with the AI community for mitigation, aiming for open collaboration on security issues.

With a significance score of 4.3, this news ranks in the top 4% of today's 30377 analyzed articles.

Get summaries of news with significance over 5.5 (usually ~10 stories per week). Read by 10,000+ subscribers:

New "many-shot jailbreaking" technique exploits large language models | News Minimalist