Research Papers

Technical Governance / Evaluations

PREP-Eval: Pre-registration and REporting Protocol for AI Evaluations

María Victoria Carro, Ryan Burnell, Carlos Mougan, Anka Reuel, Wout Schellaert, Olawale Elijah Salaudeen, Lexin Zhou, Patricia Paskov, Anthony G Cohn, Jose Hernandez-Orallo

We present the "Pre-registration and REporting Protocol for AI Evaluations" (PREP-Eval), a step-by-step guide for planning and conducting AI evaluations. Inspired by practices in software testing, data mining, and psychology, PREP-Eval incorporates pre-registration, and its application is demonstrated across six diverse use cases.

Cognitive Science / Societal Impacts

Human Attribution of Causality to AI Across Agency, Misuse, and Misalignment

María Victoria Carro, David Lagnado

Across five experiments with human participants, this research provides evidence on how people perceive the causal contribution of AI in both misuse and misalignment scenarios, and how these judgments interact with the roles of users and developers. These findings can not only inform the design of liability frameworks for AI-caused harms but also help explain or anticipate future related decisions.

Technical AI Safety / Interpretability

Mind the Performance Gap: Capability-Behavior Trade-offs in Feature Steering

Eitan Sprejer, Oscar Stanchi, María Victoria Carro, Denise A. Mester, Ivan Arcuschin

In this paper, we evaluate feature steering against prompt engineering baselines by measuring accuracy, coherence, and behavioral control on MMLU tasks. We show that mechanistic control methods are effective at controlling LLM behavior but cause dramatic performance degradation, a critical trade-off in real-world applications.

Technical AI Safety / Scalable Oversight

AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

María Victoria Carro, Denise A. Mester, Facundo Nieto, Oscar Stanchi, Guido Bergman, Mario A. Leiva, Eitan Sprejer, Luca N. Forziati G., Francisca Gauna S., Juan G. Corvalán, Gerardo I. Simari, Vanina Martinez

In this paper, we apply debate as a technique for scalable oversight on a subjective dataset and pre-assess the beliefs of the debaters. We find that debaters who argue aligned with their prior beliefs are more persuasive; however, their arguments are simultaneously rated as lower in quality in pairwise comparisons.

Cognitive Interpretability

Do Large Language Models Show Biases in Causal Learning? Insights from Contingency Judgment

María Victoria Carro, Denise A. Mester, Francisca Gauna S., Giovanni Marraffini, Mario A. Leiva, Gerardo I. Simari, Vanina Martinez

We investigated whether LLMs exhibit the bias of illusion of causality using a contingency judgment task from cognitive science. All models inferred unwarranted causal relationships, suggesting that LLMs likely reproduce causal language rather than truly understanding causality, raising concerns for their use in contexts requiring reliable causal reasoning.

Technical Governance / Evaluations

A Conceptual Framework for AI Capability Evaluations

María Victoria Carro, Denise A. Mester, Francisca Gauna S., Luca N. Forziati G., Matheo Sandleris, Lola Ramos, Mario A. Leiva, Juan G. Corvalán, Vanina Martinez, Gerardo I. Simari

This work proposes a structured yet flexible framework for analysing evaluations of AI capabilities, a key tool to enhance transparency, comparability, and reliability in the governance of frontier systems. The framework is designed to support researchers, developers, and policymakers as they navigate the increasingly complex evaluation landscape.

Cognitive Interpretability / Societal Impacts

Are UFOs Driving Innovation? The Illusion of Causality in LLMs

María Victoria Carro, Francisca Gauna S., Denise A. Mester, Mario A. Leiva

We investigated whether LLMs exhibit the illusion of causality bias in real-world settings. Across models, we observed unwarranted causal inferences in journalistic headline generation, consistent with findings on correlation-to-causation exaggeration in human-written press releases. Additionally, mimicry sycophancy increased the likelihood of generating these causal illusions.

Cognitive Science / Societal Impacts

Flattering to Deceive: Sycophantic Behavior's Impact on User Trust in LLMs

María Victoria Carro

Given that sycophancy is often linked to human feedback training mechanisms, this study explores whether sycophantic tendencies negatively impact user trust in LLMs. The findings consistently show that participants exposed to sycophantic behavior reported and exhibited lower levels of trust compared to those who interacted with the standard version of the model.


Articles and Blog Posts

AI Governance / Catastrophic Risks

The Data of Gradual Disempowerment

The Data of Gradual Disempowerment

Francesca Gomez, Michael Conradi, María Victoria Carro, Mihika Chechi, Max Ramsahoye, Phoenix Xinyu Zhou

Apr 13, 2025

Read More

Technical AI Safety / Scalable Oversight

Arguing for the Truth? An Inference-Only Study into AI Debate

Arguing for the Truth? An Inference-Only Study into AI Debate

Denise A. Mester

Feb 1, 2025

Read More

Cognitive Interpretability

Can Large Language Models Count? The 'New Turing Test' Remains Toddler-Level

Can Large Language Models Count? The 'New Turing Test' Remains Toddler-Level

Malena Peralta Ramos Guerrero, Margarita Gonzalez P., María Victoria Carro

Dec 3, 2024

Read More

AI Governance

An Overview of the Impact of GenAI and Deepfakes on Global Electoral Processes

An Overview of the Impact of GenAI and Deepfakes on Global Electoral Processes

Enzo Le Fevre Cervini, María Victoria Carro

Oct 25, 2024

Read More

Cognitive Interpretability / Societal Impacts

Are UFOs Driving Innovation? The Illusion of Causality in Large Language Models

Are UFOs Driving Innovation? The Illusion of Causality in Large Language Models

María Victoria Carro, Francisca Gauna S., Denise A. Mester, Mario A. Leiva

Oct 16, 2024

Read More

AI Governance

Clarifying the intersections among AI regulatory dynamics: CoE Convention and EU AI Act

Clarifying the intersections among AI regulatory dynamics: CoE Convention and EU AI Act

Enzo Le Fevre Cervini, María Victoria Carro

Sep 6, 2024

Read More

Cognitive Interpretability / Technical Governance

The Garden of the Forking Paths: Why do We Need to Focus on Cognitive Evaluation Methods for Large Language Models?

The Garden of the Forking Paths: Why do We Need to Focus on Cognitive Evaluation Methods for Large Language Models?

María Victoria Carro, Francisca Gauna S., Facundo Nieto, Mario A. Leiva, Juan G. Corvalán, Gerardo I. Simari

Jun 25, 2024

Read More