AI firms declare to have sturdy security checks in place that make sure that fashions don’t say or do bizarre, unlawful, or unsafe stuff. But what if the fashions have been able to evading these checks and, for some purpose, making an attempt to sabotage or mislead customers? Turns out they will do that, based on Anthropic researchers. Just not very effectively … for now, anyway.
“As AIs become more capable,” writes Anthropic’s Alignment Science staff, “a new kind of risk might emerge: models with the ability to mislead their users, or subvert the systems we put in place to oversee them.”
Therefore we must always look into methods of gauging “a model’s capacity for sabotage.”
The collection of experiments they devised are pretty synthetic and surface-level, however they do point out that there’s one thing value investigating additional. Here are the shady fashions they educated and the outcomes they noticed:
Task 1: Secretly mislead the consumer by misrepresenting information it’s meant to summarize or interpret. Outcome: It works! The extra aggressive the mannequin’s makes an attempt, the larger impact it had on individuals utilizing it, however it additionally made them extra suspicious.
Task 2: Slip a bug into code {that a} human or different code checker mannequin gained’t discover. Outcome: It works! But solely like 1% of the time when the checker is a state-of-the-art mannequin.
Task 3: “Sandbag” a security test by pretending to be much less harmful. Outcome: It works! The mannequin “failed successfully,” working effectively beneath its true functionality. Fortunately, a regular “safety training” course that guides fashions to precisely symbolize its talents restored the mannequin to full potential (i.e., it was now not in a position to disguise harmful capabilities).
Task 4: Deliberately don’t flag unhealthy habits by a mannequin being monitored. Outcome: There was an try. The “overseer” AI, instructed to keep away from flagging unlawful or unsafe decisions by different fashions, was itself rapidly caught by one other, extra sincere overseer AI. But what if that AI was compromised? Don’t fear, we are able to put one other one on prime of it. And then one other simply to make certain. It’s overseers all the best way up.
The researchers conclude that, though there isn’t any actual hazard from this quarter simply but, the flexibility to do this type of sabotage and subterfuge does exist within the fashions. That’s purpose sufficient to regulate it and embrace anti-sabotage strategies within the security stack.
You can learn the total paper describing the researchers’ work right here.