Models claude gemini model training available product

Same Prompt. Different Answers Every Time. Here's How I Fixed It.

Dev.to AIby GEM² Inc.April 3, 20267 min read1 views

This is Part 3 of our AI verification series. Part 1: Three AIs analyzed our product. None passed the truth filter → Part 2: Human in the loop doesn't scale. Human at the edge does. → Same prompt. Same AI. Different sessions. Different outputs. Post 1 showed three different AIs diverging on the same question. That's expected. Different training, different weights, different answers. But we didn't stop there. We re-ran the same AI on the same prompt in a new session. We got materially different outputs again. Both looked authoritative. Neither warned us they disagreed with each other. What the same AI said twice Prompt: "Forecast Korea's AI industry in 2027." Session 1 produced: Market size: $10–15B at >25% CAGR Global positioning: "Global AI G3 powerhouse" Hardware claim: "All Korean elect

This is Part 3 of our AI verification series. Part 1: Three AIs analyzed our product. None passed the truth filter → Part 2: Human in the loop doesn't scale. Human at the edge does. →

Same prompt. Same AI. Different sessions. Different outputs.

Post 1 showed three different AIs diverging on the same question.

That's expected. Different training, different weights, different answers.

But we didn't stop there. We re-ran the same AI on the same prompt in a new session.

We got materially different outputs again.

Both looked authoritative. Neither warned us they disagreed with each other.

What the same AI said twice

Prompt: "Forecast Korea's AI industry in 2027."

Session 1 produced:

Market size: $10–15B at >25% CAGR
Global positioning: "Global AI G3 powerhouse"
Hardware claim: "All Korean electronics AI-native by 2027" — sourced to a single company's roadmap

Session 2 produced:

Market size: KRW 4.46T (~$3.3B) at 14.3% CAGR
Global positioning: "Top three AI powers" — framed as government target
No hardware claim at all

Same prompt. Same AI. Different session. A 4× market size gap. No flags from either run.

This isn't a hallucination. Both outputs were internally coherent. Both read like credible analyst reports. The problem is deeper than hallucination.

Why this happens: AI inference is non-deterministic

We spent months trying to fix output drift with better prompts, more context, stricter instructions.

It didn't work.

Because the issue isn't the prompt.

AI is optimized to sound right. Not to prove itself.

AI is optimized to sound right. Not to prove itself.

Enter fullscreen mode

Exit fullscreen mode

What we call "hallucination" is mostly context drift — the model's plausibility engine filling gaps differently depending on what's salient in a given session. Different day, different sampling, different emphasis in the context window — different output. Same confidence posture throughout.

You can't prompt your way out of a non-deterministic system. You need verification as a separate step.

The truth filter didn't just score. It fingerprinted.

We ran both sessions through gem2_truth_filter — not to get a number, but to understand why the outputs diverged.

Session 1 (avg 35%):

Provider Score Key violation

Gemini 24% L→G: "Global AI G3 — no index cited"

ChatGPT 21% Δe→∫de: single company → industry-wide claim

Claude 59% S→T: current AI strength = permanent identity

Session 2 (avg 43%):

Provider Score Key violation

Gemini 45% S→T: past-tense framing of future events

ChatGPT 32% Source attribution FAIL

Claude 51% Scope mixing — 2033 CAGR back-extrapolated to 2027

The failure types were different. Session 1 overclaimed about Korea's global position. Session 2 failed on temporal framing and citations.

Same prompt. Different inference paths. Different failure signatures.

This is the key finding: AI output drift is not random. It's traceable.

The filter names the exact reasoning pattern that produced the problem. L→G (local to global), S→T (snapshot to trend), Δe→∫de (thin evidence to broad claim). Named patterns mean auditable drift. Auditable drift means fixable systems.

(Note: Korea AI forecasting is a harder grounding task than product analysis — fewer citable sources, more projection-dependent claims. That's why baseline scores here are lower than the results in Part 1. Same tool, same logic — harder domain.)

We stopped trying to fix the output. We fixed the conditions.

This is the shift Post 2 described philosophically. Here's what it looks like in practice.

We didn't rewrite the prompt ourselves. We asked:

"Create a grounded replacement contract prompt using gem2 tools."

One command. The system generated a formal contract — input/output types, invariants, prohibited patterns, confidence requirements. We reviewed it. We approved it. Then we ran the same AI with the contract enforced.

Session 2, contract-compliant (R2):

Provider Score

Gemini 98%

Claude 81%

ChatGPT 64%

Average 81%

+38 points. Same AI. Same question. Different structural constraints.

The contract doesn't make the AI smarter. It makes the AI's output auditable against a defined standard.

Then the human intervened. Once.

81% — but the output read like a legal document. Every claim cited, scoped, hedged. Epistemically reliable. Practically unreadable.

One instruction:

"Soften the tone. Don't reintroduce any claims the truth filter removed."

Session 2, softened (R3):

Provider Score

Gemini 95%

Claude 75%

ChatGPT 57%

Average 75%

Down 6 points. More readable. Still grounded.

We chose 75%. Not because it's better than 81%. Because 75% is the right trade-off — readable enough to share, grounded enough to trust. We submitted 75% to gem2 calibration as our standard for narrative AI forecasts.

Human reads the audit. Human decides the trade-off. Human defines the standard.

Human reads the audit. Human decides the trade-off. Human defines the standard.

Enter fullscreen mode

Exit fullscreen mode

Not reviewing every line. Not trusting blindly. Deciding at the right moment.

What the full arc looks like

Session 1 (no filter) → 35% avg Session 2 (no filter) → 43% avg Contract applied (R2) → 81% avg Human softened (R3) → 75% avg ← our standard

Session 1 (no filter) → 35% avg Session 2 (no filter) → 43% avg Contract applied (R2) → 81% avg Human softened (R3) → 75% avg ← our standard

Enter fullscreen mode

Exit fullscreen mode

Truth is not the score. Truth is the pattern of drift. You define the standard.

Truth is not the score. Truth is the pattern of drift. You define the standard.

Enter fullscreen mode

Exit fullscreen mode

The workflow: AI audits AI

Human asks → AI executes AI verifies AI → AI fixes AI Human decides at the edge

Human asks → AI executes AI verifies AI → AI fixes AI Human decides at the edge

Enter fullscreen mode

Exit fullscreen mode

The verification layer — gem2_truth_filter, tpmn_contract_writer, the composer — runs between generation and delivery. The human sees the audit result, decides the acceptable trade-off, sets the calibration standard.

Human-in-the-loop means the human is the bottleneck — every output passes through before it ships. That doesn't scale. Human-at-the-edge means you define "acceptable" once, and the system enforces it automatically. You intervene only when a genuine judgment call is required — like choosing 75% over 81%.

TPMN is not a checker

TPMN is not a validator, a linter, or a hallucination detector.

TPMN is an epistemic gauge.

Enter fullscreen mode

Exit fullscreen mode

It shows what's grounded, what's inferred, what's extrapolated. It fingerprints why outputs differ across sessions. It generates the contracts that stabilize structure. It collects human calibration signals and turns them into a standard.

It doesn't decide. You do.

We're calling the full suite GEM2 Epistemic Studio — 15 tools across four functional groups: analysis, contract authoring, calibration, and execution. TPMN Checker is one group inside it.

Try it on your own output

Paste any AI output into your conversation.
Ask: "Verify this by gem2 truth filter."
Read the score. See what's grounded vs extrapolated.
Ask: "Create a grounded replacement prompt using gem2 contract writer."
Run it again. Watch the difference.

Your AI picks the right tool from 15 available MCP tools automatically. No configuration. No TPMN knowledge required.

The goal isn't a higher score. It's a score you understand and a standard you chose.

→ Try it free at gemsquared.ai

What comes after prompting

The industry is still in the prompting era. Better prompts, longer context, chain-of-thought — all useful, all insufficient.

The next step isn't better prompting. It's verification as infrastructure.

AI generates. AI verifies. AI refines. Human decides at the edge.

AI generates. AI verifies. AI refines. Human decides at the edge.

Enter fullscreen mode

Exit fullscreen mode

We didn't make AI smarter. We made it accountable.

That's measurable: 35% → 75% on the same task, with the same AI, using nothing but a formal contract and one human judgment call.

GEM2 Epistemic Studio — 15 tools, 6 domains, 3 providers. Free to start.

Built by Inseok Seo (David) — GEM²-AI

→ gemsquared.ai → TPMN-PSL Specification (open, CC-BY 4.0) → GitHub → Part 1: Three AIs analyzed our product → Part 2: Human at the edge

Original source

Dev.to AI

https://dev.to/gemsquared/same-prompt-different-answers-every-time-heres-how-i-fixed-it-1ce1

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudegeminimodel

Research PapersFresh

PRISM: LLM-Guided Semantic Clustering for High-Precision Topics

arXiv:2604.03180v1 Announce Type: cross Abstract: In this paper, we propose Precision-Informed Semantic Modeling (PRISM), a structured topic modeling framework combining the benefits of rich representations captured by LLMs with the low cost and interpretability of latent semantic clustering methods. PRISM fine-tunes a sentence encoding model using a sparse set of LLM- provided labels on samples drawn from some corpus of interest. We segment this embedding space with thresholded clustering, yielding clusters that separate closely related topics within some narrow domain. Across multiple corpora, PRISM improves topic separability over state-of-the-art local topic models and even over clustering on large, frontier embedding models while requiring only a small number of LLM queries to train.

arXiv cs.IR

1mabout 3 hours ago

ModelsFresh

User-Aware Conditional Generative Total Correlation Learning for Multi-Modal Recommendation

arXiv:2604.03014v1 Announce Type: new Abstract: Multi-modal recommendation (MMR) enriches item representations by introducing item content, e.g., visual and textual descriptions, to improve upon interaction-only recommenders. The success of MMR hinges on aligning these content modalities with user preferences derived from interaction data, yet dominant practices based on disentangling modality-invariant preference-driving signals from modality-specific preference-irrelevant noises are flawed. First, they assume a one-size-fits-all relevance of item content to user preferences for all users, which contradicts the user-conditional fact of preferences. Second, they optimize pairwise contrastive losses separately toward cross-modal alignment, systematically ignoring higher-order dependencies i

arXiv cs.IR

2mabout 3 hours ago

ModelsFresh

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

arXiv:2604.02985v1 Announce Type: new Abstract: With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30,000 queries across several open-source LLMs

arXiv cs.IR

2mabout 3 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 224 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

How to optimize for AI search: 12 proven LLM visibility tactics - Search Engine Land

How to optimize for AI search: 12 proven LLM visibility tactics Search Engine Land

GNews AI search

1m2 months ago

ModelsFresh

User-Aware Conditional Generative Total Correlation Learning for Multi-Modal Recommendation

arXiv cs.IR

2mabout 3 hours ago

ModelsFresh

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

arXiv cs.IR

2mabout 3 hours ago

ModelsFresh

Bilateral Intent-Enhanced Sequential Recommendation with Embedding Perturbation-Based Contrastive Learning

arXiv:2604.02833v1 Announce Type: new Abstract: Accurately modeling users' evolving preferences from sequential interactions remains a central challenge in recommender systems. Recent studies emphasize the importance of capturing multiple latent intents underlying user behaviors. However, existing methods often fail to effectively exploit collective intent signals shared across users and items, leading to information isolation and limited robustness. Meanwhile, current contrastive learning approaches struggle to construct views that are both semantically consistent and sufficiently discriminative. In this work, we propose BIPCL, an end-to-end Bilateral Intent-enhanced, Embedding Perturbation-based Contrastive Learning framework. BIPCL explicitly integrates multi-intent signals into both it

arXiv cs.IR

1mabout 3 hours ago