Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens
In my previous post I found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”. This post investigates activation steering applied to latent reasoning and examines the resulting performance changes. Quick Summary: Tuned Logit lens sometimes does not find the final answer to a prompt and instead finds a close approximation Tuned Logit lens does not seem to have a consistent location layer or latent where the final answer is positioned. Tuned logit lens variants like ones only trained on latent 3 still only have therefore on odd vectors. Activation stee
In my previous post I found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”.
This post investigates activation steering applied to latent reasoning and examines the resulting performance changes.
Quick Summary:
- Tuned Logit lens sometimes does not find the final answer to a prompt and instead finds a close approximation
- Tuned Logit lens does not seem to have a consistent location layer or latent where the final answer is positioned.
- Tuned logit lens variants like ones only trained on latent 3 still only have therefore on odd vectors.
- Activation steering for the average difference between latent vectors did not create increases in accuracy with specific latent pair combinations and instead matched closely with random vectors from “Can we interpret latent reasoning using current mechanistic interpretability tools?”
- Steering the kv cache to steer CODI outputs can increase accuracy while steering with hidden states do not seem to have a significant effect on CODI
Experimental setup
CoDI model
I use the publicly available CODI Llama 3.2 1B checkpoint from Can we interpret latent reasoning using current mechanistic interpretability tools?
Tuned Logit Lens
To create my tuned logit lens implementation I used the code implementation for the training of Tuned logit lens from Eliciting Latent Predictions from Transformers with the Tuned LensActivation Steering
- Embedding steering
Getting the average hidden state from each latent vector and using the difference between latent vector A and B to steer the hidden states.
Since codi uses the kv values on eot token. To get new kv values that contain the info from the steered vector I needed to steer latent 1 -> run codi for one additional latent and then get the kv values of latent 2 and see the output.
- KV cache Steering
Steering the KV kache and adding the steered KV kache directly onto the codi model. Directly adding average difference in kv values to past_key_values.
Experiments
Confirming Previous Assumptions
PROMPT = "Out of 600 employees in a company, 30% got promoted while 10% received bonus. How many employees did not get either a promotion or a bonus?"
Answer = 360
Tuned Logit Lens properties:
- Tuned lens approximates but, doesn't find the answer in some cases like 720 (360 x 2) and 350 (360 - 10) latent 0 and 1
- Approximate answers are not GSM8K artifacts as neither of these numbers are in the most common answers for the dataset
- The answers being found in latent 3 and 5 for my previous post with tuned lens might be prompt specific. This suggests tuned lens might just be used as a way to see potential outputs
Default Tuned
Default
The following is the answer frequency for the GSM8K data used to train the tuned logit lens
This prompted me to revisit my previous results using a tuned logit lens trained only on latent 3. Notably, 'therefore' still appears only on odd latents, even with this different prompt.
Activation Difference (Steering Embeddings)
Across all coefficient values tested, the steering was applied to latents 1–4, with one additional latent step run afterward to obtain updated KV values. The steered models seem to consistently underperform the baseline of no steering until the later latents match the performance of random vector patching “Can we interpret latent reasoning using current mechanistic interpretability tools?”. This might be the case because steering acts the same as random vector patching as the average difference vector might be too noisy to encode meaningful directional information.
Activation Difference (Steer KV cache)
- Unlike the other method of steering which required another codi pass to get new kv values to pass this method steered the kv values as it was being used on the EoT token to generate the answer
- Set up is getting the mean activations of latents A and B and subtracting them and steering the difference with a coefficient. Latent A is the first latent vector which is in turn subtracted by another latent vector B
- Steering the kv values unlike steering the hidden states seemed to change work in changing the accuracy of latent step 5.
- Most vectors being used for steering performed worse than random latent vector activation patching. Some performed significantly better than the baseline
- Coefficient (0.5):
- The steered vectors that worked to improve performance are A1-B2, A1-B5, A2-B3, A2 - B3, A2-B4, A3-B5, A4-B5 coefficient 1. When steering with the difference of an earlier latent vector and a later latent vector it is interesting how combinations that included latent 2 performed the best for the latent A.
- Coefficient (-1):
- The negative coefficient flips A-B to B-ASince the coefficient is -1 A1-B4, A1-B6, A4-B6, A5-B6 can be interpreted as B4-A1 B6-A1, B6-A4, B6-A5. It seems latents are steered with 6 minus an earlier latent like 1,4,5 seems to have significant increase in accuracy. And differences like between the 1 and 6 and difference latents 5 and 6 seemed to have the highest increase in accuracy.
- Accuracy for all steering decreases as the coefficient increases
- There is no activation difference that improves accuracy in positive and negative coefficients
Positive Coefficients
Negative Coefficients
A1-B2
B4-A1
A1-B5
A2 - B3
A2-B4
A3-B5
A4-B5
B6-A1
B6-A4
B6-A5
Baseline
For negative coefficients A1-B4, A1-B6, A4-B6, A5-B6 performed better than the baseline after the steering for latent 5 a common pattern is with negative coefficient after steering performed significantly better than the baseline for latent 5
The positive latents performed better than the baselines on A1-B2, A1-B5, A2-B3, A2 - B4, A2-B4, A3-B5, A4-B5.
Activation Difference (Logit Lens)
No clear pattern emerges from the activation difference logit lens. The first image is of default logit lens the second image is tuned logit lens the axis is on the y it is latent A on x latent B the activation difference is vectors A - B and the logit lens was done on the differences mean activation A and B for the different layers of the model.
Future Work
- Find a setup that makes activation steering work with CODI
- Complete the thought anchors with CODI
- Why did certain activation differences for the KV cache increase accuracy
- Look with other methods such as PCA to observe the reason why activation steering worked on kv but, not hidden state.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
llamamodeltransformer
30 Days of Building a Small Language Model — Day 1: Neural Networks
Welcome to day one. Before I introduce tokenizers, transformers, or training loops, we start where almost all modern machine learning starts: the neural network. Think of the first day as laying down the foundation you will reuse for the next twenty-nine days. If you have ever felt that neural networks sound like a black box, this post is for you. We will use a simple picture is this a dog or a cat? and walk through what actually happens inside the model, in plain language. What is a neural network? A neural network is made of layers. Each layer has many small units. Data flows in one direction: each unit takes numbers from the previous layer, updates them, and sends new numbers forward. During training, the network adjusts itself so its outputs get closer to the correct answers on example
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

30 Days of Building a Small Language Model — Day 1: Neural Networks
Welcome to day one. Before I introduce tokenizers, transformers, or training loops, we start where almost all modern machine learning starts: the neural network. Think of the first day as laying down the foundation you will reuse for the next twenty-nine days. If you have ever felt that neural networks sound like a black box, this post is for you. We will use a simple picture is this a dog or a cat? and walk through what actually happens inside the model, in plain language. What is a neural network? A neural network is made of layers. Each layer has many small units. Data flows in one direction: each unit takes numbers from the previous layer, updates them, and sends new numbers forward. During training, the network adjusts itself so its outputs get closer to the correct answers on example



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!