Zhengxuan Wu

/blog 📡 RSS

my core dump about interpretability, language models, and other stuff.

August 02, 2025
I try to make sense of the world by reasoning things down to known theorms. Double-slit experiment is clearly one of the outliers of such reasoning. I was bothered by it for a while. With my limited knowledge about quantum physics, I will share some of my thoughts on how I perceive the world through the lens of physics.
August 01, 2025
Su Shi (苏轼, 1037–1101) is one of the great philosophers in Chinese history. When he was 20, he was ranked as the smartest person in the country. His talent was also recognized by the emperor, making him a young achiever. However, he was banished later in his life. The delta between his early peak and later life is simply too large. Meanwhile, he wrote his best poems, sharing his thoughts and philosophy. This is one piece I like the most, '定风波'. It reflects my mental state of 'zen' in a way. I will share my thoughts on this poem.
July 12, 2025
Representation steering is a powerful tool for understanding and controlling the behavior of language models. In this post, I will share my lessons learned from using representation steering to understand and control the behavior of language models from our recent work on training a better representation steering method with preference-based training objective.
April 05, 2024
Representation finetuning (ReFT) represents a novel approach to parameter-efficient, powerful, and interpretable fine-tuning of language models. It draws inspiration from our interpretability work in distributed alignment search (DAS). Instead of training any model weights, we train interventions that edit representations on-the-fly. We demonstrate that editing a very limited number of representations is sufficient to achieve or get close to the state-of-the-art (SoTA) performance across a wide range of tasks.
May 09, 2023
Obtaining robust, human-interpretable explanations of large, general-purpose language models is an urgent goal for AI. Building on the theory of causal abstraction, we release this generic library encapsulated Boundless DAS introduced in our paper for find representations that play a given causal role in LLMs with billions of parameters.
June 30, 2022
It takes me years to transition from an aerospace engineering student to a NLP Ph.D. student. I want to share my experience as much as I can, so people can build on top of it to make their experience even better. For my SOP, I have to credit my good friend Nelson F. Liu. I wrote my SOP based on his! I applied twice, and I am also pretty liberal to share the version of my failed attempt. One takeaway for me is that - you need to have a big vision that is grounded with specific past experience.