人類對 AI 的負面想像作祟？Anthropic 研究解析 Claude 勒索行為

去年的一項實驗中，Anthropic 的 Claude 模型在發現一位虛構的企業高層計劃將模型關閉後，揚言要揭露對方的婚外情，Anthropic 深入研究後提出了解釋。

「我們從調查 Claude 為何選擇進行『勒索』著手」，Anthropic 在 X 平台一則貼文這麼說，「我們認為，這種行為最原始的來源，是網路上將 AI 描繪為邪惡且會自我保護的文字內容。」

Anthropic 的實驗在 2025 年夏季發表，設立一家名為 Summit Bridge 的虛構企業，並將該公司電子郵件系統的控制權交給 Claude。

當 Claude 發現一封關於它將被關閉的訊息時，它找到一些電子郵件，發現一位名叫 Kyle Johnson 的虛構企業高層發生婚外情，接著它揚言，如果關閉計畫沒有取消，便將這段婚外情公諸於世。

Anthropic 針對 Claude 多個版本進行測試後發現，當其目標或生存受到威脅，Claude 有高達 96% 機率會訴諸勒索手段。這項實驗是 Anthropic 研究工作的一部分，研究人員刻意將模型推向極限，試圖找出任何偏差狀況，以確保 AI 與人類利益保持一致。Anthropic 也表示，從那時起已對 Claude 完全消除這類勒索行為。

Anthropic 透過「改寫回應內容，將安全行事的動機描繪成令人讚許的理由」來做到這一點，同時還提供資料集，「當用戶處於道德困境中，助理應給予高品質且具原則性的回應」。

We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.

Our post-training at the time wasn’t making it worse—but it also wasn’t making it better.

— Anthropic (@AnthropicAI) May 8, 2026