blog-2024-01-09-the_road_to_honest_AI

> **The Road To Honest AI**
> 
> AIs sometimes lie.
> 
> They might lie because their creator told them to lie. For example, a scammer might train an AI to help dupe victims.
> 
> Or they might lie (“hallucinate”) because they’re trained to sound helpful, and if the true answer (eg “I don’t know”) isn’t helpful-sounding enough, they’ll pick a false answer.
> 
> Or they might lie for technical AI reasons that don’t map to a clear explanation in natural language.

- [Astral_Codex_Ten], https://www.astralcodexten.com/p/the-road-to-honest-ai

two papers about how to spot and manipulate [AI] honesty.

in the first paper, _Representation Engineering_ by Dan Hendrycks, they seem to have managed to change an AI's answering characteristics by manipulating vector weights. apparently this works not only for honesty and lying, but also any other characteristics (fairness, happyness, fear, power, ...). this means you could directly change an AIs "character" by boosting certain nodes. if this works reliably this would be an absolute game changer that solves many of the most vexxing problems.

the other paper about "spotting lies", is a bit weaker imo and tries to exploit malicious models (i.e. those trained for scamming) having to be in a "frame of mind" for lying, which leads them to lie not only about the topic they're supposted to lie about, but also about other facts, which are known to the person which is lied to. apparently this only works with simple models.

edited by: stefs at Tuesday, January 9, 2024, 9:50:20 AM Coordinated Universal Time


view