Models llama model transformer training available update

Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens

LessWrongby RealmbirdApril 4, 20266 min read0 views

In my previous post I found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”. This post investigates activation steering applied to latent reasoning and examines the resulting performance changes. Quick Summary: Tuned Logit lens sometimes does not find the final answer to a prompt and instead finds a close approximation Tuned Logit lens does not seem to have a consistent location layer or latent where the final answer is positioned. Tuned logit lens variants like ones only trained on latent 3 still only have therefore on odd vectors. Activation stee

This post investigates activation steering applied to latent reasoning and examines the resulting performance changes.

Quick Summary:

Tuned Logit lens sometimes does not find the final answer to a prompt and instead finds a close approximation
Tuned Logit lens does not seem to have a consistent location layer or latent where the final answer is positioned.
Tuned logit lens variants like ones only trained on latent 3 still only have therefore on odd vectors.
Activation steering for the average difference between latent vectors did not create increases in accuracy with specific latent pair combinations and instead matched closely with random vectors from “Can we interpret latent reasoning using current mechanistic interpretability tools?”
Steering the kv cache to steer CODI outputs can increase accuracy while steering with hidden states do not seem to have a significant effect on CODI

Experimental setup

CoDI model

I use the publicly available CODI Llama 3.2 1B checkpoint from Can we interpret latent reasoning using current mechanistic interpretability tools?

Tuned Logit Lens

To create my tuned logit lens implementation I used the code implementation for the training of Tuned logit lens from Eliciting Latent Predictions from Transformers with the Tuned LensActivation Steering

Embedding steering

Getting the average hidden state from each latent vector and using the difference between latent vector A and B to steer the hidden states.

Since codi uses the kv values on eot token. To get new kv values that contain the info from the steered vector I needed to steer latent 1 -> run codi for one additional latent and then get the kv values of latent 2 and see the output.

KV cache Steering

Steering the KV kache and adding the steered KV kache directly onto the codi model. Directly adding average difference in kv values to past_key_values.

Experiments

Confirming Previous Assumptions

PROMPT = "Out of 600 employees in a company, 30% got promoted while 10% received bonus. How many employees did not get either a promotion or a bonus?"

Answer = 360

Tuned Logit Lens properties:

Tuned lens approximates but, doesn't find the answer in some cases like 720 (360 x 2) and 350 (360 - 10) latent 0 and 1
Approximate answers are not GSM8K artifacts as neither of these numbers are in the most common answers for the dataset
The answers being found in latent 3 and 5 for my previous post with tuned lens might be prompt specific. This suggests tuned lens might just be used as a way to see potential outputs

Default Tuned

Default

The following is the answer frequency for the GSM8K data used to train the tuned logit lens

This prompted me to revisit my previous results using a tuned logit lens trained only on latent 3. Notably, 'therefore' still appears only on odd latents, even with this different prompt.

Activation Difference (Steering Embeddings)

Across all coefficient values tested, the steering was applied to latents 1–4, with one additional latent step run afterward to obtain updated KV values. The steered models seem to consistently underperform the baseline of no steering until the later latents match the performance of random vector patching “Can we interpret latent reasoning using current mechanistic interpretability tools?”. This might be the case because steering acts the same as random vector patching as the average difference vector might be too noisy to encode meaningful directional information.

Activation Difference (Steer KV cache)

Unlike the other method of steering which required another codi pass to get new kv values to pass this method steered the kv values as it was being used on the EoT token to generate the answer
Set up is getting the mean activations of latents A and B and subtracting them and steering the difference with a coefficient. Latent A is the first latent vector which is in turn subtracted by another latent vector B
Steering the kv values unlike steering the hidden states seemed to change work in changing the accuracy of latent step 5.
Most vectors being used for steering performed worse than random latent vector activation patching. Some performed significantly better than the baseline
Coefficient (0.5):
The steered vectors that worked to improve performance are A1-B2, A1-B5, A2-B3, A2 - B3, A2-B4, A3-B5, A4-B5 coefficient 1. When steering with the difference of an earlier latent vector and a later latent vector it is interesting how combinations that included latent 2 performed the best for the latent A.
Coefficient (-1):
The negative coefficient flips A-B to B-ASince the coefficient is -1 A1-B4, A1-B6, A4-B6, A5-B6 can be interpreted as B4-A1 B6-A1, B6-A4, B6-A5. It seems latents are steered with 6 minus an earlier latent like 1,4,5 seems to have significant increase in accuracy. And differences like between the 1 and 6 and difference latents 5 and 6 seemed to have the highest increase in accuracy.
Accuracy for all steering decreases as the coefficient increases
There is no activation difference that improves accuracy in positive and negative coefficients

Positive Coefficients

Negative Coefficients

A1-B2

B4-A1

A1-B5

A2 - B3

A2-B4

A3-B5

A4-B5

B6-A1

B6-A4

B6-A5

Baseline

For negative coefficients A1-B4, A1-B6, A4-B6, A5-B6 performed better than the baseline after the steering for latent 5 a common pattern is with negative coefficient after steering performed significantly better than the baseline for latent 5

The positive latents performed better than the baselines on A1-B2, A1-B5, A2-B3, A2 - B4, A2-B4, A3-B5, A4-B5.

Activation Difference (Logit Lens)

No clear pattern emerges from the activation difference logit lens. The first image is of default logit lens the second image is tuned logit lens the axis is on the y it is latent A on x latent B the activation difference is vectors A - B and the logit lens was done on the differences mean activation A and B for the different layers of the model.

Future Work

Find a setup that makes activation steering work with CODI
Complete the thought anchors with CODI
Why did certain activation differences for the KV cache increase accuracy
Look with other methods such as PCA to observe the reason why activation steering worked on kv but, not hidden state.

Original source

LessWrong

https://www.lesswrong.com/posts/mXuqpJkJpaeTjyCgm/latent-reasoning-sprint-3-activation-difference-steering-and-1

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llamamodeltransformer

Models

Getting Started with Llama 3.1 405B: Build Custom LLMs with Synthetic Data Generation and Distillation - Snowflake

Getting Started with Llama 3.1 405B: Build Custom LLMs with Synthetic Data Generation and Distillation Snowflake

GNews AI fine-tuning

1m5 months ago

Models

Enterprise AI’s biggest risk isn’t the model — it’s the data - ynetnews

Enterprise AI’s biggest risk isn’t the model — it’s the data ynetnews

Google News - Scale AI data

1m26 days ago

ModelsFresh

30 Days of Building a Small Language Model — Day 1: Neural Networks

Welcome to day one. Before I introduce tokenizers, transformers, or training loops, we start where almost all modern machine learning starts: the neural network. Think of the first day as laying down the foundation you will reuse for the next twenty-nine days. If you have ever felt that neural networks sound like a black box, this post is for you. We will use a simple picture is this a dog or a cat? and walk through what actually happens inside the model, in plain language. What is a neural network? A neural network is made of layers. Each layer has many small units. Data flows in one direction: each unit takes numbers from the previous layer, updates them, and sends new numbers forward. During training, the network adjusts itself so its outputs get closer to the correct answers on example

Reddit r/LocalLLaMA

6mabout 3 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 178 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens

Quick Summary:

Experimental setup

CoDI model

Tuned Logit Lens

To create my tuned logit lens implementation I used the code implementation for the training of Tuned logit lens from Eliciting Latent Predictions from Transformers with the Tuned LensActivation Steering

Experiments

Confirming Previous Assumptions

Activation Difference (Steering Embeddings)

Activation Difference (Steer KV cache)

Baseline

Activation Difference (Logit Lens)

Future Work

Daily AI Digest

More about

Getting Started with Llama 3.1 405B: Build Custom LLMs with Synthetic Data Generation and Distillation - Snowflake

Enterprise AI’s biggest risk isn’t the model — it’s the data - ynetnews

30 Days of Building a Small Language Model — Day 1: Neural Networks

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

MemRL outperforms RAG on complex agent benchmarks without fine-tuning - VentureBeat

Getting Started with Llama 3.1 405B: Build Custom LLMs with Synthetic Data Generation and Distillation - Snowflake

Enterprise AI’s biggest risk isn’t the model — it’s the data - ynetnews

30 Days of Building a Small Language Model — Day 1: Neural Networks