1 Comment
User's avatar
⭠ Return to thread
Alberto Romero's avatar

Very likely. Here's a paper from Anthropic on this: https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training. Worth a read, it's the exact security flaw you're talking about. I don't see LLMs being used in any high-stakes category where cybersecurity is a real concern any time soon (sleeper agents is just one of many different problems, like jailbreaking, prompt injection, and adversarial attacks, feel free to look those up as well).

Expand full comment