blog-2024-01-09-the_road_to_honest_AI

> **The Road To Honest AI**
> 
> AIs sometimes lie.
> 
> They might lie because their creator told them to lie. For example, a scammer might train an AI to help dupe victims.
> 
> Or they might lie (“hallucinate”) because they’re trained to sound helpful, and if the true answer (eg “I don’t know”) isn’t helpful-sounding enough, they’ll pick a false answer.
> 
> Or they might lie for technical AI reasons that don’t map to a clear explanation in natural language.

- [Astral_Codex_Ten], https://www.astralcodexten.com/p/the-road-to-honest-ai

two papers about how to spot and manipulate [AI] honesty

_Representation Engineering_ by Dan Hendrycks seems to have managed to change an AI's answering characteristics by manipulating vector weights. Apparently this works not only for honesty and lying, but also any other characteristics (fairness, happyness, fear, power, ...).

edited by: stefs at Tuesday, January 9, 2024, 8:49:04 AM Coordinated Universal Time

view