Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessDesktop Canary v2.1.48-canary.31LobeChat ReleasesThe Invisible Broken Clock in AI Video GenerationHackernoon AIMean field sequence: an introductionLessWrong AISwift package AI inference engine generated from Rust crateHacker News AI TopZeta-2 Turns Code Edits Into Context-Aware Rewrite SuggestionsHackernoon AIAI Tools That Actually Pay You Back: A Developer's Guide to Monetizing AIDev.to AIThe $6 Million Shockwave: How DeepSeek Just Broke the AI MonopolyMedium AIHow I Got My First Freelance Client in 3 Days (Using AI) — Beginner Guide (India 2026)Medium AIWhy Your Resume Gets Rejected Before a Human Sees It (And How to Fix It)Dev.to AII've Been Saying RAG Is Dead Since 2020Medium AIAI Print-on-Demand Passive Income: ₹400-2K/Design from HomeDev.to AIFaceless YouTube Automation with AI: Complete Guide 2026Dev.to AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessDesktop Canary v2.1.48-canary.31LobeChat ReleasesThe Invisible Broken Clock in AI Video GenerationHackernoon AIMean field sequence: an introductionLessWrong AISwift package AI inference engine generated from Rust crateHacker News AI TopZeta-2 Turns Code Edits Into Context-Aware Rewrite SuggestionsHackernoon AIAI Tools That Actually Pay You Back: A Developer's Guide to Monetizing AIDev.to AIThe $6 Million Shockwave: How DeepSeek Just Broke the AI MonopolyMedium AIHow I Got My First Freelance Client in 3 Days (Using AI) — Beginner Guide (India 2026)Medium AIWhy Your Resume Gets Rejected Before a Human Sees It (And How to Fix It)Dev.to AII've Been Saying RAG Is Dead Since 2020Medium AIAI Print-on-Demand Passive Income: ₹400-2K/Design from HomeDev.to AIFaceless YouTube Automation with AI: Complete Guide 2026Dev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens

LessWrongby RealmbirdApril 4, 20266 min read0 views
Source Quiz

In my previous post I found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”. This post investigates activation steering applied to latent reasoning and examines the resulting performance changes. Quick Summary: Tuned Logit lens sometimes does not find the final answer to a prompt and instead finds a close approximation Tuned Logit lens does not seem to have a consistent location layer or latent where the final answer is positioned. Tuned logit lens variants like ones only trained on latent 3 still only have therefore on odd vectors. Activation stee

In my previous post I found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”.

This post investigates activation steering applied to latent reasoning and examines the resulting performance changes.

Quick Summary:

  • Tuned Logit lens sometimes does not find the final answer to a prompt and instead finds a close approximation
  • Tuned Logit lens does not seem to have a consistent location layer or latent where the final answer is positioned.
  • Tuned logit lens variants like ones only trained on latent 3 still only have therefore on odd vectors.
  • Activation steering for the average difference between latent vectors did not create increases in accuracy with specific latent pair combinations and instead matched closely with random vectors from “Can we interpret latent reasoning using current mechanistic interpretability tools?”
  • Steering the kv cache to steer CODI outputs can increase accuracy while steering with hidden states do not seem to have a significant effect on CODI

Experimental setup

CoDI model

I use the publicly available CODI Llama 3.2 1B checkpoint from Can we interpret latent reasoning using current mechanistic interpretability tools?

Tuned Logit Lens

To create my tuned logit lens implementation I used the code implementation for the training of Tuned logit lens from Eliciting Latent Predictions from Transformers with the Tuned LensActivation Steering

  • Embedding steering

Getting the average hidden state from each latent vector and using the difference between latent vector A and B to steer the hidden states.

Since codi uses the kv values on eot token. To get new kv values that contain the info from the steered vector I needed to steer latent 1 -> run codi for one additional latent and then get the kv values of latent 2 and see the output.

  • KV cache Steering

Steering the KV kache and adding the steered KV kache directly onto the codi model. Directly adding average difference in kv values to past_key_values.

Experiments

Confirming Previous Assumptions

PROMPT = "Out of 600 employees in a company, 30% got promoted while 10% received bonus. How many employees did not get either a promotion or a bonus?"

Answer = 360

Tuned Logit Lens properties:

  • Tuned lens approximates but, doesn't find the answer in some cases  like 720 (360 x 2) and 350 (360 - 10) latent 0 and 1
  • Approximate answers are not GSM8K artifacts as neither of these numbers are in the most common answers for the dataset
  • The answers being found in latent 3 and 5 for my previous post with tuned lens might be prompt specific. This suggests tuned lens might just be used as a way to see potential outputs

Default Tuned

Default

The following is the answer frequency for the GSM8K data used to train the tuned logit lens

This prompted me to revisit my previous results using a tuned logit lens trained only on latent 3. Notably, 'therefore' still appears only on odd latents, even with this different prompt.

Activation Difference (Steering Embeddings)

Across all coefficient values tested, the steering was applied to latents 1–4, with one additional latent step run afterward to obtain updated KV values. The steered models seem to consistently underperform the baseline of no steering until the later latents match the performance of random vector patching “Can we interpret latent reasoning using current mechanistic interpretability tools?”. This might be the case because steering acts the same as random vector patching as the average difference vector might be too noisy to encode meaningful directional information.

Activation Difference (Steer KV cache)

  • Unlike the other method of steering which required another codi pass to get new kv values to pass this method steered the kv values as it was being used on the EoT token to generate the answer
  • Set up is getting the mean activations of latents A and B and subtracting them and steering the difference with a coefficient. Latent A is the first latent vector which is in turn subtracted by another latent vector B
  • Steering the kv values unlike steering the hidden states seemed to change work in changing the accuracy of latent step 5.
  • Most vectors being used for steering performed worse than random latent vector activation patching. Some performed significantly better than the baseline
  • Coefficient (0.5):
  • The steered vectors that worked to improve performance are  A1-B2, A1-B5, A2-B3, A2 - B3, A2-B4, A3-B5, A4-B5 coefficient 1. When steering with the difference of an earlier  latent vector and a later latent vector it is interesting how combinations  that included latent 2 performed the best for the latent A.
  • Coefficient (-1):
  • The negative coefficient flips A-B to B-ASince the coefficient is -1  A1-B4, A1-B6, A4-B6, A5-B6 can be interpreted as B4-A1 B6-A1, B6-A4, B6-A5. It seems latents are steered with 6 minus an earlier latent like 1,4,5 seems to have significant increase in accuracy. And differences like between the 1 and 6 and difference latents 5 and 6 seemed to have the highest increase in accuracy.
  • Accuracy for all steering decreases as the coefficient increases
  • There is no activation difference that improves accuracy in positive and negative coefficients

Positive Coefficients

Negative Coefficients

A1-B2

B4-A1

A1-B5

A2 - B3

A2-B4

A3-B5

A4-B5

B6-A1

B6-A4

B6-A5

Baseline

For negative coefficients A1-B4, A1-B6, A4-B6, A5-B6 performed better than the baseline after the steering for latent 5 a common pattern is with negative coefficient  after steering performed  significantly better than the baseline for latent 5

The positive latents performed better than the baselines on A1-B2, A1-B5, A2-B3, A2 - B4, A2-B4, A3-B5, A4-B5.

Activation Difference (Logit Lens)

No clear pattern emerges from the activation difference logit lens. The first image is of default logit lens the second image is tuned logit lens the axis is on the y it is latent A on x latent B the activation difference is vectors A - B and the logit lens was done on the differences mean activation A and B for the different layers of the model.

Future Work

  • Find a setup that makes activation steering work with CODI
  • Complete the thought anchors with CODI
  • Why did certain activation differences for the KV cache increase accuracy
  • Look with other methods such as PCA to observe the reason why activation steering worked on kv but, not hidden state.
Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Latent Reas…llamamodeltransformertrainingavailableupdateLessWrong

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 178 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models