Zhengxuan Wu

zen's blog

my core dump about interpretability, language models, and other stuff.

Blog posts

May 27, 2025
Representation steering is a powerful tool for understanding and controlling the behavior of language models. In this post, I will share my lessons learned from using representation steering to understand and control the behavior of language models from our recent work on training a better representation steering method with preference-based training objective.
April 05, 2024
Representation finetuning (ReFT) represents a novel approach to parameter-efficient, powerful, and interpretable fine-tuning of language models. It draws inspiration from our interpretability work in distributed alignment search (DAS). Instead of training any model weights, we train interventions that edit representations on-the-fly. We demonstrate that editing a very limited number of representations is sufficient to achieve or get close to the state-of-the-art (SoTA) performance across a wide range of tasks.
May 09, 2023
Obtaining robust, human-interpretable explanations of large, general-purpose language models is an urgent goal for AI. Building on the theory of causal abstraction, we release this generic library encapsulated Boundless DAS introduced in our paper for find representations that play a given causal role in LLMs with billions of parameters.
June 30, 2022
It takes me years to transition from an aerospace engineering student to a NLP Ph.D. student. I want to share my experience as much as I can, so people can build on top of it to make their experience even better. For my SOP, I have to credit my good friend Nelson F. Liu. I wrote my SOP based on his! I applied twice, and I am also pretty liberal to share the version of my failed attempt. One takeaway for me is that - you need to have a big vision that is grounded with specific past experience.