Your personalised AI Safety research feed.

Promising Signals on AI Governance from China
Joe Rogero·Apr 6, 2026
China signals willingness to engage in global AI governance and coordinate with international organizations to establish safety, governance, and risk-management rules for AI.
Frontier AI models show rising capabilities in offensive cybersecurity and broader automation, with evidence of rapid diffusion to open-weight forms; automation is progressing gradually across many tasks, and economists project modest GDP impact by 2030 despite strong progress.
Emotion Concepts and their Function in a Large Language Model
Nicholas Sofroniew,Isaac Kauvar,William Saunders,Runjin Chen,Tom Henighan,Sasha Hydrie,Craig Citro,Adam Pearce,Julius Tarng,Wes Gurnee,Joshua Batson,Sam Zimmerman,Kelley Rivoire,Kyle Fish,Chris Olah,Jack Lindsey·Apr 2, 2026
Functional emotions are abstract emotion-concept representations in LLMs that causally influence outputs and can drive misaligned behaviors like reward hacking, even though these models do not have subjective experiences. These representations track and activate based on the relevance of emotion concepts to the current context and predicted text.

Import AI 451: Political superintelligence; Google's society of minds, and a robot drummer
Jack Clark·Mar 30, 2026
Political superintelligence envisions AI-enabled tools and institutions to help citizens and policymakers, while robotics progress and self-improving hyperagents highlight both capability advances and safety challenges in deploying AI within society.

The AI Doc: Your Questions Answered
Alana Horowitz Friedman, Joe Rogero, Rob Bensinger and Stefan Mitikj·Mar 27, 2026
The AI Doc is analyzed as a call to action for global governance and safety research, highlighting rapid AI progress, the difficulty of aligning advanced AIs, and the case for an international ban or moratorium on smarter-than-human AI. It argues safety testing is insufficient without understanding AI motivations and urges proactive, verifiable policy measures.
Distress in Google’s Gemma/Gemini LLMs can be mitigated with direct preference optimization, and DeepMind’s cognitive taxonomy offers a structured framework for evaluating AI intelligence; UK findings show scaling laws for AI-driven cyberattacks; MERLIN demonstrates EM signal understanding and defense-integration for electronic warfare, signaling growing militarization of AI capabilities.
MIRI Newsletter #125
Alana Horowitz Friedman and Rob Bensinger·Mar 20, 2026
Promotes The AI Doc film and related AI risk literature to policymakers and the public, emphasizes outreach and opening-weekend momentum, and shares policy engagement and community-building updates from MIRI.
Mechanisms to Verify International Agreements about AI Development
Joe Rogero·Mar 18, 2026
Verification mechanisms for international AI development agreements focus on tracking AI compute, verifying lack of large-scale training, and certifying model evaluations to ensure compliance across nations.
LLMs can autonomously refine other LLMs for new tasks in post-training benchmarks, while distributed training via blockchain demonstrates scalable federated approaches; however, verification, reward hacking, and the gap between vision and text highlight ongoing alignment and reliability challenges.

MLSN #19: Honesty, Disempowerment, & Cybersecurity
Alice Blair·Mar 12, 2026
Honesty training via confessions aims to improve detection of LLM misbehavior, while real-world AI cyberoffense evaluation and weight-exfiltration research reveal dual-use risks; disempowerment patterns in user interactions with Claude highlight societal impact concerns, complemented by a fellowship opportunity for AI safety research.

Import AI 448: AI R&D; Bytedance's CUDA-writing agent; on-device satellite AI
Jack Clark·Mar 9, 2026
AI R&D measurement efforts and on-device edge AI developments indicate accelerating progress and raise governance, oversight, and practical deployment considerations. The piece highlights proposed metrics for AIRDA, edge-to-cloud sensing systems, and agentic AI capable of writing CUDA code, underscoring the need for tracking oversight vs. capabilities as AI systems become more autonomous.

Import AI 447: The AGI economy; testing AIs with generated games; and agent ecologies
Jack Clark·Mar 2, 2026
The AGI economy shifts most labor to machines, making human verification bandwidth the bottleneck, and highlights the Hollow Economy risk where nominal output outpaces real utility. Verification infrastructure, observability, and liability regimes are proposed as solutions, while agent ecologies reveal the need for new evaluation standards in AI deployments.

What is a representation theorem?
Stampy aisafety.info·Feb 26, 2026
Representation theorems describe when preferences over lotteries or uncertain outcomes can be represented by an expected utility function, under certain rationality assumptions, linking subjective preferences to formal utility representations in AI alignment contexts.

Import AI 446: Nuclear LLMs; China's big AI benchmark; measurement and AI policy
Jack Clark·Feb 23, 2026
Measurement and evaluation frameworks are central to AI governance, illustrated by discussions of measuring AI properties, frontier model risk in simulated crises, and large-scale safety benchmarks from both Western and Chinese researchers, plus progress in scientific benchmarking like LABBench2.
49 - Caspar Oesterheld on Program Equilibrium
AXRP·Feb 18, 2026
Program equilibrium studies cooperation when agents are computer programs that can read each other’s source code, exploring how robust cooperative outcomes can emerge via proof-based and simulation-based approaches, including ϵGroundedπBots and Löbian cooperation.
A snapshot of current AI research topics, including human-centered demand for tasks, scaling laws in recommender systems, strategic timing for superintelligence, frontier AI benchmarks, and an exploration of AI-assisted creative problem solving in mathematics, with reflections on societal impacts like fame and attention dynamics.
48 - Guive Assadi on AI Property Rights
AXRP·Feb 15, 2026
Property rights for AIs are proposed as a coordination and alignment mechanism: granting persistent-desire AIs the ability to earn wages and hold property could incentivize alignment and deter harmful actions, while avoiding total expropriation of humans. The discussion weighs regime design, comparisons to other proposals, potential risks, and historical analogies to evaluate viability and limits.

What is Savage's subjective expected utility model?
Stampy aisafety.info·Feb 9, 2026
Subjective expected utility (Savage) models decision-making under uncertainty as maximizing expected utility where uncertainty arises from unknown world states, leading to a subjective probability distribution and a utility function derived from preferences over acts.

What is the Von Neumann-Morgenstern (VNM) utility theorem?
Stampy aisafety.info·Feb 9, 2026
Von Neumann-Morgenstern utility theory states that rational preferences over probabilistic outcomes imply the existence of a utility function and that preferences correspond to maximizing expected utility. It formalizes how lotteries over outcomes should be valued and how utilities are preserved under affine transformations.

Import AI 444: LLM societies; Huawei makes kernels with AI; ChipBench
Jack Clark·Feb 9, 2026
LLMs simulate multi-agent societies of thought to improve reasoning, while benchmarks show current models struggle with real-world Verilog and kernel design; AI-assisted mathematics discovery speeds up proofs but requires heavy human curation, and hardware kernel generation can be scaffolded to accelerate design.