Jesse Mu

I am a member of technical staff on the Alignment team at Anthropic. I received my PhD in Computer Science from Stanford University in 2023, where I was advised by Noah Goodman and supported by the Open Phil AI Fellowship and the NSF GRFP. During my PhD, I spent time at MIT LINGO, FAIR, and DeepMind. Previously, I received an MPhil in Advanced Computer Science from the University of Cambridge on a Churchill Scholarship and a BA from Boston College.

My research interests are in operationalizing language for machine learning: for example, training [1, 2, 3, 4] or interpreting [5] models with language explanations, or by building agents that use language for multi-agent collaboration [6, 7, 8, 9, 10, 11].

Publications

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, 17 others, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez
arXiv 2024 [samples, Nature, VentureBeat, Ars Technica, Business Insider, Astral Codex Ten, Elon]

Learning to Compress Prompts with Gist Tokens
Jesse Mu, Xiang Lisa Li, Noah Goodman
NeurIPS 2023 [code]

Characterizing tradeoffs between teaching via language and demonstrations in multi-agent systems
Dhara Yu, Noah Goodman, Jesse Mu
CogSci 2023

Improving Intrinsic Exploration with Language Abstractions
Jesse Mu, Victor Zhong, Roberta Raileanu, Minqi Jiang, Noah Goodman, Tim Rocktäschel, Edward Grefenstette
NeurIPS 2022 [Yannic Kilcher video, interview]

Improving Policy Learning via Language Dynamics Distillation
Victor Zhong, Jesse Mu, Luke Zettlemoyer, Edward Grefenstette, Tim Rocktäschel
NeurIPS 2022

STaR: Bootstrapping Reasoning with Reasoning
Eric Zelikman, Yuhai(Tony) Wu, Jesse Mu, Noah Goodman
NeurIPS 2022

Active Learning Helps Pretrained Models Learn the Intended Task
Alex Tamkin, Dat Nguyen, Salil Deshpande, Jesse Mu, Noah Goodman
NeurIPS 2022

In the ZONE: Measuring difficulty and progression in curriculum generation
Rose Wang, Jesse Mu, Dilip Arumugam, Natasha Jaques, Noah Goodman
NeurIPS 2022 Deep RL Workshop

Emergent Covert Signaling in Adversarial Reference Games
Dhara Yu, Jesse Mu, Noah Goodman
ICLR 2022 Emergent Communication Workshop

Multi-party Referential Communication in Complex Strategic Games
Jessica Mankewitz, Veronica Boyce, Brandon Waldon, Georgia Loukatou, Dhara Yu, Jesse Mu, Noah Goodman, Michael Frank
NeurIPS 2021 Meaning in Context (MiC) Workshop

Emergent Communication of Generalizations
Jesse Mu, Noah Goodman
NeurIPS 2021 [NeurIPS talk, ViGIL 2021 spotlight talk, code]

Calibrate Your Listeners! Robust Communication-based Training for Pragmatic Speakers
Rose E. Wang, Julia White, Jesse Mu, Noah Goodman
Findings of EMNLP 2021 [code]

Compositional Explanations of Neurons
Jesse Mu, Jacob Andreas
NeurIPS 2020 (oral) [Weights & Biases talk, NeurIPS oral, code]

Learning to Refer Informatively by Amortizing Pragmatic Reasoning
Julia White, Jesse Mu, Noah Goodman
CogSci 2020 [code]

Shaping Visual Representations with Language for Few-shot Classification
Jesse Mu, Percy Liang, Noah Goodman
ACL 2020 [SAIL blog post, talk, code]

Learning Outside the Box: Discourse-level Features Improve Metaphor Identification
Jesse Mu, Helen Yannakoudakis, Ekaterina Shutova
NAACL 2019 [talk, code]

Do We Need Natural Language? Exploring Restricted Language Interfaces for Complex Domains
Jesse Mu, Advait Sarkar
CHI 2019 Late-Breaking Work [code]

The Meta-Science of Adult Statistical Word Segmentation: Part 1
Joshua K. Hartshorne, Lauren Skorb, Sven L. Dietz, Caitlin R. Garcia, Gina L. Iozzo, Katie E. Lamirato, James R. Ledoux, Jesse Mu, Kara N. Murdock, Jon Ravid, Alyssa A. Savery, James E. Spizzirro, Kelsey A. Trimm, Kendall D. van Horne, Juliani Vidal
Collabra: Psychology 5(1):1, 2019 [code]

Evaluating Hierarchies of Verb Argument Structure with Hierarchical Clustering
Jesse Mu, Joshua K. Hartshorne, Timothy O'Donnell
EMNLP 2017

Parkinson's Disease Subtypes Identified from Cluster Analysis of Motor and Non-motor Symptoms
Jesse Mu, Kallol R. Chaudhuri, Concha Bielza, Jesus de Pedro-Cuesta, Pedro Larrañaga, Pablo Martinez-Martin
Frontiers in Aging Neuroscience 9:301, 2017 [code]