Models claude model language model benchmark training available

Meta's new structured prompting technique makes LLMs significantly better at code review — boosting accuracy to 93% in some cases

VentureBeat AIby [email protected] (Ben Dickson)April 1, 20267 min read0 views

Deploying AI agents for repository-scale tasks like bug detection, patch verification, and code review requires overcoming significant technical hurdles. One major bottleneck: the need to set up dynamic execution sandboxes for every repository, which are expensive and computationally heavy. Using large language model (LLM) reasoning instead of executing the code is rising in popularity to bypass this overhead, yet it frequently leads to unsupported guesses and hallucinations. To improve execution-free reasoning, researchers at Meta introduce " semi-formal reasoning ," a structured prompting technique. This method requires the AI agent to fill out a logical certificate by explicitly stating premises, tracing concrete execution paths, and deriving formal conclusions before providing an answe

Using large language model (LLM) reasoning instead of executing the code is rising in popularity to bypass this overhead, yet it frequently leads to unsupported guesses and hallucinations.

To improve execution-free reasoning, researchers at Meta introduce "semi-formal reasoning," a structured prompting technique. This method requires the AI agent to fill out a logical certificate by explicitly stating premises, tracing concrete execution paths, and deriving formal conclusions before providing an answer.

The structured format forces the agent to systematically gather evidence and follow function calls before drawing conclusions. This increases the accuracy of LLMs in coding tasks and significantly reduces errors in fault localization and codebase question-answering.

For developers using LLMs in code review tasks, semi-formal reasoning enables highly reliable, execution-free semantic code analysis while drastically reducing the infrastructure costs of AI coding systems.

Agentic code reasoning

Agentic code reasoning is an AI agent's ability to navigate files, trace dependencies, and iteratively gather context to perform deep semantic analysis on a codebase without running the code. In enterprise AI applications, this capability is essential for scaling automated bug detection, comprehensive code reviews, and patch verification across complex repositories where relevant context spans multiple files.

The industry currently tackles execution-free code verification through two primary approaches. The first involves unstructured LLM evaluators that try to verify code either directly or by training specialized LLMs as reward models to approximate test outcomes. The major drawback is their reliance on unstructured reasoning, which allows models to make confident claims about code behavior without explicit justification. Without structured constraints, it is difficult to ensure agents reason thoroughly rather than guess based on superficial patterns like function names.

The second approach involves formal verification, which translates code or reasoning into formal mathematical languages like Lean, Coq, or Datalog to enable automated proof checking. While rigorous, formal methods require defining the semantics of the programming language. This is entirely impractical for arbitrary enterprise codebases that span multiple frameworks and languages.

Existing approaches also tend to be highly fragmented and task-specific, often requiring entirely separate architectures or specialized training for each new problem domain. They lack the flexibility needed for broad, multi-purpose enterprise applications.

How semi-formal reasoning works

To bridge the gap between unstructured guessing and overly rigid mathematical proofs, the Meta researchers propose a structured prompting methodology, which they call “semi-formal reasoning.” This approach equips LLM agents with task-specific, structured reasoning templates.

Example template of semi-formal reasoning (source: arXiv)

These templates function as mandatory logical certificates. To complete a task, the agent must explicitly state premises, trace execution paths for specific tests, and derive a formal conclusion based solely on verifiable evidence.

The template forces the agent to gather proof from the codebase before making a judgment. The agent must actually follow function calls and data flows step-by-step rather than guessing their behavior based on surface-level naming conventions. This systematic evidence gathering helps the agent handle edge cases, such as confusing function names, and avoid making unsupported claims.

Semi-formal reasoning in action

The researchers evaluated semi-formal reasoning across three software engineering tasks: patch equivalence verification to determine if two patches yield identical test outcomes without running them, fault localization to pinpoint the exact lines of code causing a bug, and code question answering to test nuanced semantic understanding of complex codebases. The experiments used the Claude Opus-4.5 and Sonnet-4.5 models acting as autonomous verifier agents.

The team compared their structured semi-formal approach against several baselines, including standard reasoning, where an agentic model is given a minimal prompt and allowed to explain its thinking freely in unstructured natural language. They also compared against traditional text-similarity algorithms like difflib.

Semi-formal reasoning shows significant improvement over standard reasoning (source: arXiv)

In patch equivalence, semi-formal reasoning improved accuracy on challenging, curated examples from 78% using standard reasoning to 88%. When evaluating real-world, agent-generated patches with test specifications available, the Opus-4.5 model using semi-formal reasoning achieved 93% verification accuracy, outperforming both the unstructured single-shot baseline at 86% and the difflib baseline at 73%. Other tasks showed similar gains across the board.

The paper highlights the value of semi-formal reasoning through real-world examples. In one case, the agent evaluates two patches in the Python Django repository that attempt to fix a bug with 2-digit year formatting for years before 1000 CE. One patch uses a custom format() function within the library that overrides the standard function used in Python.

Standard reasoning models look at these patches, assume format() refers to Python's standard built-in function, calculate that both approaches will yield the same string output, and incorrectly declare the patches equivalent.

Example of semi-formal reasoning vs standard reasoning (source: arXiv)

With semi-formal reasoning, the agent traces the execution path and checks method definitions. Following the structured template, the agent discovers that within one of the library’s files, the format() name is actually shadowed by a custom, module-level function. The agent formally proves that given the attributes of the input passed to the code, this patch will crash the system while the other will succeed.

Based on their experiments, the researchers suggest that “LLM agents can perform meaningful semantic code analysis without execution, potentially reducing verification costs in RL training pipelines by avoiding expensive sandbox execution.”

Caveats and tradeoffs

While semi-formal reasoning offers substantial reliability improvements, enterprise developers must consider several practical caveats before adopting it. There is a clear compute and latency tradeoff. Semi-formal reasoning requires more API calls and tokens. In patch equivalence evaluations, semi-formal reasoning required roughly 2.8 times as many execution steps as standard unstructured reasoning.

The technique also does not universally improve performance, particularly if a model is already highly proficient at a specific task. When researchers evaluated the Sonnet-4.5 model on a code question-answering benchmark, standard unstructured reasoning already achieved a high accuracy of around 85%. Applying the semi-formal template in this scenario yielded no additional gains.

Furthermore, structured reasoning can produce highly confident wrong answers. Because the agent is forced to build elaborate, formal proof chains, it can become overly assured if its investigation is deep but incomplete. In one Python evaluation, the agent meticulously traced five different functions to uncover a valid edge case, but completely missed that a downstream piece of code already safely handled that exact scenario. Because it had built a strong evidence chain, it delivered an incorrect conclusion with extremely high confidence.

The system's reliance on concrete evidence also breaks down when it hits the boundaries of a codebase. When analyzing third-party libraries where the underlying source code is unavailable, the agent will still resort to guessing behavior based on function names.

And in some cases, despite strict prompt instructions, models will occasionally fail to fully trace concrete execution paths.

Ultimately, while semi-formal reasoning drastically reduces unstructured guessing and hallucinations, it does not completely eliminate them.

What developers should take away

This technique can be used out-of-the-box, requiring no model training or special packaging. It is code-execution free, which means you do not need to add additional tools to your LLM environment. You pay more compute at inference time to get higher accuracy at code review tasks.

The researchers suggest that structured agentic reasoning may offer “a flexible alternative to classical static analysis tools: rather than encoding analysis logic in specialized algorithms, we can prompt LLM agents with task-specific reasoning templates that generalize across languages and frameworks."

The researchers have made the prompt templates available, allowing them to be readily implemented into your applications. While there is a lot of conversation about prompt engineering being dead, this technique shows how much performance you can still squeeze out of well-structured prompts.

Original source

VentureBeat AI

https://venturebeat.com/orchestration/metas-new-structured-prompting-technique-makes-llms-significantly-better-at

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudemodellanguage model

ModelsFresh

Bench 2xMI50 Qwen3.5 27b vs Gemma4 31B (vllm-gfx906-mobydick)

Inference engine used (vllm fork) : https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main Huggingface Quants used: QuantTrio/Qwen3.5-27B-AWQ vs cyankiwi/gemma-4-31B-it-AWQ-4bit Relevant commands to run : docker run -it --name vllm-gfx906-mobydick -v ~/llm/models:/models --network host --device=/dev/kfd --device=/dev/dri --group-add video --group-add $(getent group render | cut -d: -f3) --ipc=host aiinfos/vllm-gfx906-mobydick:latest FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG vllm serve \ /models/gemma-4-31B-it-AWQ-4bit \ --served-model-name gemma-4-31B-it-AWQ-4bit \ --dtype float16 \ --max-model-len auto \ --gpu-memory-utilization 0.95 \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --mm-processor-cache-gb 1

Reddit r/LocalLLaMA

3mabout 4 hours ago

Open Source AILive

Vllm gemma4 26b a4b it-nvfp4 run success

#!/usr/bin/env bash set -euo pipefail BASE_DIR="/mnt/d/AI/docker-gemma4" PATCH_DIR="$BASE_DIR/nvfp4_patch" BUILD_DIR="$BASE_DIR/build" HF_CACHE_DIR="$BASE_DIR/hf-cache" LOG_DIR="$BASE_DIR/logs" PATCH_FILE="$PATCH_DIR/gemma4_patched.py" DOCKERFILE_PATH="$BUILD_DIR/Dockerfile" BASE_IMAGE="vllm/vllm-openai:gemma4" PATCHED_IMAGE="vllm-gemma4-nvfp4-patched" CONTAINER_NAME="vllm-gemma4-nvfp4" MODEL_ID="bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4" SERVED_MODEL_NAME="gemma-4-26b-a4b-it-nvfp4" GPU_MEMORY_UTILIZATION="0.88" MAX_MODEL_LEN="512" MAX_NUM_SEQS="1" PORT=" " PATCH_URL=" https://huggingface.co/bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4/resolve/main/gemma4_patched.py?download=true " if [[ -z "${HF_TOKEN:-}" ]]; then echo "[ERROR] HF_TOKEN environment variable is empty." echo "Please run th

Reddit r/LocalLLaMA

2mabout 1 hour ago

Open Source AIFresh

TurboQuant on Apple Silicon: real benchmarks on Mac Mini M4 16GB and M3 Max 48GB

I’ve been testing TurboQuant this week on two machines and wanted to share the actual numbers. Why this matters: TurboQuant compresses the KV cache, not the model weights. On long contexts, KV cache can take several GB of memory, so reducing it can make a big difference even when throughput stays similar. In the setup I tested, K stays at q8_0 and V goes to turbo3 (~3-bit). That asymmetric tradeoff makes sense because errors in the keys affect attention routing more directly, while values often tolerate heavier compression better. Benchmark 1: Mac Mini M4 16GB — Qwen3-14B Q4_K_M at 8K context → Without TurboQuant: KV cache 1280 MiB, K (f16): 640 MiB, V (f16): 640 MiB — 9.95 t/s → With TurboQuant: KV cache 465 MiB, K (q8_0): 340 MiB, V (turbo3): 125 MiB — 9.25 t/s Almost 3x compression, wit

Reddit r/LocalLLaMA

2mabout 3 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 180 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

Bench 2xMI50 Qwen3.5 27b vs Gemma4 31B (vllm-gfx906-mobydick)

Reddit r/LocalLLaMA

3mabout 4 hours ago

ModelsFresh

RTX 5090 gemma4-26b TG performance report

Nothing exhaustive... but I thought I'd report what I've seen from early testing. I'm running a modified version of vLLM that has NVFP4 support for gemma4-26b. Weights come in around 15.76 GiB and the remainder is KV cache. I'm running full context as well. For a "story telling" prompt and raw output with no thinking, I'm seeing about 150 t/s on TG. TTFT in streaming mode is about 80ms. Quality is good! submitted by /u/Nice_Cellist_7595 [link] [comments]

Reddit r/LocalLLaMA

1mabout 2 hours ago

ModelsLive

Prompts you use to test/trip up your LLMs

I'm obsessed with finding prompts to test the quality of different local models. I've pretty much landed on several that I use across the board. Tell me about the Apple A6 (a pass is if it mentions Apple made their own microarchitecture called swift for the CPU cores, the main thing that the A6 is historically known for as the first Apple SOC to do it. This tests if it is smart enough to mention historically relevant information first) Tell me about the history of Phoenix's freeway network (A pass is if it gives a historical narration instead of just listing freeways. We asked for history, after all. Again, testing for its understanding of putting relevant information first.) Tell me about the Pentium D. Why was it a bad processor ( A pass is it it mentions that it glued two separate penti

Reddit r/LocalLLaMA

4mabout 1 hour ago

ModelsFresh

Abliterating Qwen3.5-397B on a Mac Studio revealed that MoE models encode refusal differently than dense models — safety refusals route through expert selection and survive weight-baking

Part of a series documenting building a fully local AI assistant on DGX Sparks + Mac Studio. I adapted FailSpy's abliteration technique for Qwen3.5-397B-A17B at 4-bit on a Mac Studio M3 Ultra (512GB). The goal was removing PRC censorship (Tiananmen, Taiwan, Uyghurs, Winnie the Pooh) from my personal assistant. Three findings I haven't seen documented anywhere: MoE models have two separable refusal subspaces. Chinese-political and Western-safety refusals are different directions in activation space. You can surgically remove one without touching the other. I removed PRC censorship while leaving drug/weapons refusals intact. Winnie the Pooh should not be a controversial topic on hardware I paid for. Weight-baking and inference hooking produce different results on MoE. On dense models, orthog

Reddit r/LocalLLaMA

2mabout 3 hours ago