
Felix Hofstätter
AI safety researcher focused on model evaluations and elicitation
Joined March 2026
Summary
Technical AI-safety researcher focused on capability evaluations and latent-capability phenomena ("sandbagging") — develops and evaluates methods to reveal or mitigate hidden model capabilities. arxiv+2
Active contributor to open-source evaluation tooling and benchmarks for LLMs-as-agents, including implementation work on AgentBench and related inspect-evals integrations. github+1
Published researcher and communicator who writes both academic papers and accessible explainers on AI-safety topics, bridging technical research and broader community discussion. mlr+2
Engaged in the AI-alignment / longtermist community and recipient of targeted funding for independent safety research (Long-Term Future Fund), showing involvement in field-building and funded research projects. effectivealtruism+2
Work
Education
Projects
Writing
The Elicitation Game: Evaluating Capability Elicitation Techniques
January 1, 2025Conference paper (PMLR/ICML) evaluating methods to elicit latent capabilities from models, introducing circuit-breaking and comparing prompting, activation steering, and fine-tuning.
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
June 1, 2024Paper studying 'sandbagging' — strategic underperformance by language models on evaluations — and methods to prompt, fine-tune, or password-lock models to hide or reveal capabilities.
How Assistance Games make AI safer
October 1, 2022Long-form article explaining Assistance Games (cooperative inverse reinforcement learning) as a framework for AI to learn human intentions via interaction; discusses benefits and limitations.