Felix Hofstätter

AI safety researcher focused on model evaluations and elicitation

Joined March 2026

Summary

Technical AI-safety researcher focused on capability evaluations and latent-capability phenomena ("sandbagging") — develops and evaluates methods to reveal or mitigate hidden model capabilities. arxiv+2

Active contributor to open-source evaluation tooling and benchmarks for LLMs-as-agents, including implementation work on AgentBench and related inspect-evals integrations. github+1

Published researcher and communicator who writes both academic papers and accessible explainers on AI-safety topics, bridging technical research and broader community discussion. mlr+2

Engaged in the AI-alignment / longtermist community and recipient of targeted funding for independent safety research (Long-Term Future Fund), showing involvement in field-building and funded research projects. effectivealtruism+2

Work

Education

Projects

Writing

The Elicitation Game: Evaluating Capability Elicitation Techniques

January 1, 2025

Conference paper (PMLR/ICML) evaluating methods to elicit latent capabilities from models, introducing circuit-breaking and comparing prompting, activation steering, and fine-tuning.

proceedings.mlr.press

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

June 1, 2024

Paper studying 'sandbagging' — strategic underperformance by language models on evaluations — and methods to prompt, fine-tune, or password-lock models to hide or reveal capabilities.

arxiv.org

How Assistance Games make AI safer

October 1, 2022

Long-form article explaining Assistance Games (cooperative inverse reinforcement learning) as a framework for AI to learn human intentions via interaction; discusses benefits and limitations.

towardsdatascience.com

Felix Hofstätter

Summary

Work

Junior Research Scientist / Member of Technical StaffApollo Research2025–Present

Independent AI safety researcher (Long-Term Future Fund grant)Freelance (independent researcher)2023–2023

Researcher / contributor (AI Safety Hub labs involvement)AI Safety Hub / Labs (collaboration)2023–2023

Software ConsultantTNG Technology Consulting2021–2022

Research project on concurrent software verificationImperial College London (research project)2019–2019

Education

MEng Mathematics and Computer Science (first class), 70.3Imperial College London2017–2021

Projects

sandbagging-elicitation (GitHub)2024–Present

AgentBench / inspect_evals contributions2024–Present

Writing

The Elicitation Game: Evaluating Capability Elicitation Techniques

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

How Assistance Games make AI safer