Research Papers
Technical Governance / Evaluations
PREP-Eval: Pre-registration and REporting Protocol for AI Evaluations
We present the "Pre-registration and REporting Protocol for AI Evaluations" (PREP-Eval), a step-by-step guide for planning and conducting AI evaluations. Inspired by practices in software testing, data mining, and psychology, PREP-Eval incorporates pre-registration, and its application is demonstrated across six diverse use cases.
Cognitive Science / Societal Impacts
Human Attribution of Causality to AI Across Agency, Misuse, and Misalignment
Across five experiments with human participants, this research provides evidence on how people perceive the causal contribution of AI in both misuse and misalignment scenarios, and how these judgments interact with the roles of users and developers. These findings can not only inform the design of liability frameworks for AI-caused harms but also help explain or anticipate future related decisions.
Technical AI Safety / Interpretability
Mind the Performance Gap: Capability-Behavior Trade-offs in Feature Steering
In this paper, we evaluate feature steering against prompt engineering baselines by measuring accuracy, coherence, and behavioral control on MMLU tasks. We show that mechanistic control methods are effective at controlling LLM behavior but cause dramatic performance degradation, a critical trade-off in real-world applications.
Technical AI Safety / Scalable Oversight
AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs
In this paper, we apply debate as a technique for scalable oversight on a subjective dataset and pre-assess the beliefs of the debaters. We find that debaters who argue aligned with their prior beliefs are more persuasive; however, their arguments are simultaneously rated as lower in quality in pairwise comparisons.
Cognitive Interpretability
Do Large Language Models Show Biases in Causal Learning? Insights from Contingency Judgment
We investigated whether LLMs exhibit the bias of illusion of causality using a contingency judgment task from cognitive science. All models inferred unwarranted causal relationships, suggesting that LLMs likely reproduce causal language rather than truly understanding causality, raising concerns for their use in contexts requiring reliable causal reasoning.
Technical Governance / Evaluations
A Conceptual Framework for AI Capability Evaluations
This work proposes a structured yet flexible framework for analysing evaluations of AI capabilities, a key tool to enhance transparency, comparability, and reliability in the governance of frontier systems. The framework is designed to support researchers, developers, and policymakers as they navigate the increasingly complex evaluation landscape.
Cognitive Interpretability / Societal Impacts
Are UFOs Driving Innovation? The Illusion of Causality in LLMs
We investigated whether LLMs exhibit the illusion of causality bias in real-world settings. Across models, we observed unwarranted causal inferences in journalistic headline generation, consistent with findings on correlation-to-causation exaggeration in human-written press releases. Additionally, mimicry sycophancy increased the likelihood of generating these causal illusions.
Cognitive Science / Societal Impacts
Flattering to Deceive: Sycophantic Behavior's Impact on User Trust in LLMs
Given that sycophancy is often linked to human feedback training mechanisms, this study explores whether sycophantic tendencies negatively impact user trust in LLMs. The findings consistently show that participants exposed to sycophantic behavior reported and exhibited lower levels of trust compared to those who interacted with the standard version of the model.
Articles and Blog Posts
AI Governance / Catastrophic Risks

The Data of Gradual Disempowerment
Apr 13, 2025
Technical AI Safety / Scalable Oversight

Arguing for the Truth? An Inference-Only Study into AI Debate
Feb 1, 2025
Cognitive Interpretability

Can Large Language Models Count? The 'New Turing Test' Remains Toddler-Level
Dec 3, 2024
AI Governance

An Overview of the Impact of GenAI and Deepfakes on Global Electoral Processes
Oct 25, 2024
Cognitive Interpretability / Societal Impacts

Are UFOs Driving Innovation? The Illusion of Causality in Large Language Models
Oct 16, 2024
AI Governance

Clarifying the intersections among AI regulatory dynamics: CoE Convention and EU AI Act
Sep 6, 2024
Cognitive Interpretability / Technical Governance

The Garden of the Forking Paths: Why do We Need to Focus on Cognitive Evaluation Methods for Large Language Models?
Jun 25, 2024