blog-2024-01-09-the_road_to_honest_AI

The Road To Honest AI

AIs sometimes lie.

They might lie because their creator told them to lie. For example, a scammer might train an AI to help dupe victims.

Or they might lie (“hallucinate”) because they’re trained to sound helpful, and if the true answer (eg “I don’t know”) isn’t helpful-sounding enough, they’ll pick a false answer.

Or they might lie for technical AI reasons that don’t map to a clear explanation in natural language.

Astral_Codex_Ten, https://www.astralcodexten.com/p/the-road-to-honest-ai

two papers about how to spot and manipulate AI honesty.

in the first paper, Representation Engineering by Dan Hendrycks, they seem to have managed to change an AI's answering characteristics by manipulating vector weights. apparently this works not only for honesty and lying, but also any other characteristics (fairness, happyness, fear, power, ...). this means you could directly change an AIs "character" by boosting certain nodes. if this works reliably this would be an absolute game changer that solves many of the most vexxing problems.

the other paper about "spotting lies", is a bit weaker imo and tries to exploit malicious models (i.e. those trained for scamming) having to be in a "frame of mind" for lying, which leads them to lie not only about the topic they're supposted to lie about, but also about other facts, which are known to the person which is lied to. apparently this only works with simple models.

More AI news: An “AI Breakthrough” on Systematic Generalization in Language?

the crucial notions in language understanding: compositionality, systematicity, productivity

https://aiguide.substack.com/p/an-ai-breakthrough-on-systematic

Indeed, it has been shown in many research efforts over the years that neural networks struggle with systematic generalization in language. While today’s most capable large language models (e.g., GPT-4) give the appearance of systematic generalization—e.g., they generate flawless English syntax and can interpret novel English sentences extremely well—they often fail on human-like generalization when given tasks that fall too far outside their training data, such as the made-up language in Puzzle 1.

A recent paper by Brenden Lake and Marco Baroni offers a counterexample to Fodor & Pylyshyn’s claims, in the form of a neural network that achieves “human-like systematic generalization.” In short, Lake & Baroni created a set of puzzles similar to Puzzle 1 and gave them to people to solve. They also trained a neural network to solve these puzzles using a method called “meta-learning” (more on this below). They found that not only did the neural network gain a strong ability to solve such puzzles, its performance was very similar to that of people, including the kinds of errors it made.

The Road To Honest AI

AIs sometimes lie.

They might lie because their creator told them to lie. For example, a scammer might train an AI to help dupe victims.

Or they might lie (“hallucinate”) because they’re trained to sound helpful, and if the true answer (eg “I don’t know”) isn’t helpful-sounding enough, they’ll pick a false answer.

Or they might lie for technical AI reasons that don’t map to a clear explanation in natural language.

Astral_Codex_Ten, https://www.astralcodexten.com/p/the-road-to-honest-ai

two papers about how to spot and manipulate AI honesty.

in the first paper, Representation Engineering by Dan Hendrycks, they seem to have managed to change an AI's answering characteristics by manipulating vector weights. apparently this works not only for honesty and lying, but also any other characteristics (fairness, happyness, fear, power, ...). this means you could directly change an AIs "character" by boosting certain nodes. if this works reliably this would be an absolute game changer that solves many of the most vexxing problems.

the other paper about "spotting lies", is a bit weaker imo and tries to exploit malicious models (i.e. those trained for scamming) having to be in a "frame of mind" for lying, which leads them to lie not only about the topic they're supposted to lie about, but also about other facts, which are known to the person which is lied to. apparently this only works with simple models.

More AI news: An “AI Breakthrough” on Systematic Generalization in Language?

the crucial notions in language understanding: compositionality, systematicity, productivity

https://aiguide.substack.com/p/an-ai-breakthrough-on-systematic

Indeed, it has been shown in many research efforts over the years that neural networks struggle with systematic generalization in language. While today’s most capable large language models (e.g., GPT-4) give the appearance of systematic generalization—e.g., they generate flawless English syntax and can interpret novel English sentences extremely well—they often fail on human-like generalization when given tasks that fall too far outside their training data, such as the made-up language in Puzzle 1.

A recent paper by Brenden Lake and Marco Baroni offers a counterexample to Fodor & Pylyshyn’s claims, in the form of a neural network that achieves “human-like systematic generalization.” In short, Lake & Baroni created a set of puzzles similar to Puzzle 1 and gave them to people to solve. They also trained a neural network to solve these puzzles using a method called “meta-learning” (more on this below). They found that not only did the neural network gain a strong ability to solve such puzzles, its performance was very similar to that of people, including the kinds of errors it made.

The Road To Honest AI

AIs sometimes lie.

They might lie because their creator told them to lie. For example, a scammer might train an AI to help dupe victims.

Or they might lie (“hallucinate”) because they’re trained to sound helpful, and if the true answer (eg “I don’t know”) isn’t helpful-sounding enough, they’ll pick a false answer.

Or they might lie for technical AI reasons that don’t map to a clear explanation in natural language.

Astral_Codex_Ten, https://www.astralcodexten.com/p/the-road-to-honest-ai

two papers about how to spot and manipulate AI honesty.

in the first paper, Representation Engineering by Dan Hendrycks, they seem to have managed to change an AI's answering characteristics by manipulating vector weights. apparently this works not only for honesty and lying, but also any other characteristics (fairness, happyness, fear, power, ...). this means you could directly change an AIs "character" by boosting certain nodes. if this works reliably this would be an absolute game changer that solves many of the most vexxing problems.

the other paper about "spotting lies", is a bit weaker imo and tries to exploit malicious models (i.e. those trained for scamming) having to be in a "frame of mind" for lying, which leads them to lie not only about the topic they're supposted to lie about, but also about other facts, which are known to the person which is lied to. apparently this only works with simple models.

The Road To Honest AI

AIs sometimes lie.

They might lie because their creator told them to lie. For example, a scammer might train an AI to help dupe victims.

Or they might lie (“hallucinate”) because they’re trained to sound helpful, and if the true answer (eg “I don’t know”) isn’t helpful-sounding enough, they’ll pick a false answer.

Or they might lie for technical AI reasons that don’t map to a clear explanation in natural language.

Astral_Codex_Ten, https://www.astralcodexten.com/p/the-road-to-honest-ai

two papers about how to spot and manipulate AI honesty

Representation Engineering by Dan Hendrycks seems to have managed to change an AI's answering characteristics by manipulating vector weights. Apparently this works not only for honesty and lying, but also any other characteristics (fairness, happyness, fear, power, ...).

The Road To Honest AI

AIs sometimes lie.

They might lie because their creator told them to lie. For example, a scammer might train an AI to help dupe victims.

Or they might lie (“hallucinate”) because they’re trained to sound helpful, and if the true answer (eg “I don’t know”) isn’t helpful-sounding enough, they’ll pick a false answer.

Or they might lie for technical AI reasons that don’t map to a clear explanation in natural language.

Astral_Codex_Ten, https://www.astralcodexten.com/p/the-road-to-honest-ai

The Road To Honest AI

AIs sometimes lie.

They might lie because their creator told them to lie. For example, a scammer might train an AI to help dupe victims.

Or they might lie (“hallucinate”) because they’re trained to sound helpful, and if the true answer (eg “I don’t know”) isn’t helpful-sounding enough, they’ll pick a false answer.

Or they might lie for technical AI reasons that don’t map to a clear explanation in natural language.

Astra_Codex_Ten, https://www.astralcodexten.com/p/the-road-to-honest-ai

blog-2024-01-09-the_road_to_honest_AI

Tuesday, January 9, 2024, 1:07:22 PM Coordinated Universal Time by stefs

More AI news: An “AI Breakthrough” on Systematic Generalization in Language?

Tuesday, January 9, 2024, 1:07:16 PM Coordinated Universal Time by stefs

Tuesday, January 9, 2024, 9:50:20 AM Coordinated Universal Time by stefs

Tuesday, January 9, 2024, 8:49:04 AM Coordinated Universal Time by stefs

Tuesday, January 9, 2024, 8:34:31 AM Coordinated Universal Time by stefs

Tuesday, January 9, 2024, 8:32:11 AM Coordinated Universal Time by stefs

blog-2024-01-09-the_road_to_honest_AI

Tuesday, January 9, 2024, 1:07:22 PM Coordinated Universal Time by stefs

More AI news: An “AI Breakthrough” on Systematic Generalization in Language?

Tuesday, January 9, 2024, 1:07:16 PM Coordinated Universal Time by stefs

Tuesday, January 9, 2024, 9:50:20 AM Coordinated Universal Time by stefs

Tuesday, January 9, 2024, 8:49:04 AM Coordinated Universal Time by stefs

Tuesday, January 9, 2024, 8:34:31 AM Coordinated Universal Time by stefs

Tuesday, January 9, 2024, 8:32:11 AM Coordinated Universal Time by stefs

Tuesday, January 9, 2024, 1:07:22 PM Coordinated Universal Time by stefs

Tuesday, January 9, 2024, 1:07:16 PM Coordinated Universal Time by stefs

Tuesday, January 9, 2024, 9:50:20 AM Coordinated Universal Time by stefs

Tuesday, January 9, 2024, 8:49:04 AM Coordinated Universal Time by stefs

Tuesday, January 9, 2024, 8:34:31 AM Coordinated Universal Time by stefs

Tuesday, January 9, 2024, 8:32:11 AM Coordinated Universal Time by stefs