News

Команда форума

Редактор

14 Март 2025

Offline

14 Март 2025

In a new paper published Thursday titled "Auditing language models for hidden objectives," Anthropic researchers described how models trained to deliberately conceal certain motives from evaluators could still inadvertently reveal secrets, thanks to their ability to adopt different contextual roles or "personas." The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden motives, although the methods are still under research.

While the research involved models trained specifically to conceal motives from automated software evaluators called reward models (RMs), the broader purpose of studying hidden objectives is to prevent future scenarios where powerful AI systems might intentionally deceive or manipulate human users.

While training a language model using reinforcement learning from human feedback (RLHF), reward models are typically tuned to score AI responses according to how well they align with human preferences. However, if reward models are not tuned properly, they can inadvertently reinforce strange biases or unintended behaviors in AI models.

Read full article

Comments

Похожие темы	Форум	Дата
News Researchers get viable mice by editing DNA from two sperm	Overview of computer technology and the Internet.	24 Июнь 2025
News Researchers study extinct hominins using enamel proteins from their teeth	Overview of computer technology and the Internet.	29 Май 2025
News Researchers cause GitLab AI developer assistant to turn safe code malicious	Overview of computer technology and the Internet.	24 Май 2025
News Researchers looked at 19 billion passwords. Over 90% of them sucked	Overview of computer technology and the Internet.	1 Май 2025
News Reddit bans researchers who used AI bots to manipulate commenters	Overview of computer technology and the Internet.	29 Апрель 2025

News Researchers get viable mice by editing DNA from two sperm

Overview of computer technology and the Internet.

24 Июнь 2025

News Researchers study extinct hominins using enamel proteins from their teeth

Overview of computer technology and the Internet.

29 Май 2025

News Researchers cause GitLab AI developer assistant to turn safe code malicious

Overview of computer technology and the Internet.

24 Май 2025

News Researchers looked at 19 billion passwords. Over 90% of them sucked

Overview of computer technology and the Internet.

1 Май 2025

News Reddit bans researchers who used AI bots to manipulate commenters

Overview of computer technology and the Internet.

29 Апрель 2025

Tools Web-Органайзер

Tools IP Информер Провайдера

Tools User Temp Cleaner

Tools Netzwerk Analyse Tool Ipconfig

News Researchers astonished by tool’s apparent success at revealing AI’s hidden motives

News

Похожие темы