Same Prompt. Different Answers Every Time. Here's How I Fixed It.
This is Part 3 of our AI verification series. Part 1: Three AIs analyzed our product. None passed the truth filter → Part 2: Human in the loop doesn't scale. Human at the edge does. → Same prompt. Same AI. Different sessions. Different outputs. Post 1 showed three different AIs diverging on the same question. That's expected. Different training, different weights, different answers. But we didn't stop there. We re-ran the same AI on the same prompt in a new session. We got materially different outputs again. Both looked authoritative. Neither warned us they disagreed with each other. What the same AI said twice Prompt: "Forecast Korea's AI industry in 2027." Session 1 produced: Market size: $10–15B at >25% CAGR Global positioning: "Global AI G3 powerhouse" Hardware claim: "All Korean elect
This is Part 3 of our AI verification series. Part 1: Three AIs analyzed our product. None passed the truth filter → Part 2: Human in the loop doesn't scale. Human at the edge does. →
Same prompt. Same AI. Different sessions. Different outputs.
Post 1 showed three different AIs diverging on the same question.
That's expected. Different training, different weights, different answers.
But we didn't stop there. We re-ran the same AI on the same prompt in a new session.
We got materially different outputs again.
Both looked authoritative. Neither warned us they disagreed with each other.
What the same AI said twice
Prompt: "Forecast Korea's AI industry in 2027."
Session 1 produced:
-
Market size: $10–15B at >25% CAGR
-
Global positioning: "Global AI G3 powerhouse"
-
Hardware claim: "All Korean electronics AI-native by 2027" — sourced to a single company's roadmap
Session 2 produced:
-
Market size: KRW 4.46T (~$3.3B) at 14.3% CAGR
-
Global positioning: "Top three AI powers" — framed as government target
-
No hardware claim at all
Same prompt. Same AI. Different session. A 4× market size gap. No flags from either run.
This isn't a hallucination. Both outputs were internally coherent. Both read like credible analyst reports. The problem is deeper than hallucination.
Why this happens: AI inference is non-deterministic
We spent months trying to fix output drift with better prompts, more context, stricter instructions.
It didn't work.
Because the issue isn't the prompt.
AI is optimized to sound right. Not to prove itself.AI is optimized to sound right. Not to prove itself.Enter fullscreen mode
Exit fullscreen mode
What we call "hallucination" is mostly context drift — the model's plausibility engine filling gaps differently depending on what's salient in a given session. Different day, different sampling, different emphasis in the context window — different output. Same confidence posture throughout.
You can't prompt your way out of a non-deterministic system. You need verification as a separate step.
The truth filter didn't just score. It fingerprinted.
We ran both sessions through gem2_truth_filter — not to get a number, but to understand why the outputs diverged.
Session 1 (avg 35%):
Provider Score Key violation
Gemini 24% L→G: "Global AI G3 — no index cited"
ChatGPT 21% Δe→∫de: single company → industry-wide claim
Claude 59% S→T: current AI strength = permanent identity
Session 2 (avg 43%):
Provider Score Key violation
Gemini 45% S→T: past-tense framing of future events
ChatGPT 32% Source attribution FAIL
Claude 51% Scope mixing — 2033 CAGR back-extrapolated to 2027
The failure types were different. Session 1 overclaimed about Korea's global position. Session 2 failed on temporal framing and citations.
Same prompt. Different inference paths. Different failure signatures.
This is the key finding: AI output drift is not random. It's traceable.
The filter names the exact reasoning pattern that produced the problem. L→G (local to global), S→T (snapshot to trend), Δe→∫de (thin evidence to broad claim). Named patterns mean auditable drift. Auditable drift means fixable systems.
(Note: Korea AI forecasting is a harder grounding task than product analysis — fewer citable sources, more projection-dependent claims. That's why baseline scores here are lower than the results in Part 1. Same tool, same logic — harder domain.)
We stopped trying to fix the output. We fixed the conditions.
This is the shift Post 2 described philosophically. Here's what it looks like in practice.
We didn't rewrite the prompt ourselves. We asked:
"Create a grounded replacement contract prompt using gem2 tools."
One command. The system generated a formal contract — input/output types, invariants, prohibited patterns, confidence requirements. We reviewed it. We approved it. Then we ran the same AI with the contract enforced.
Session 2, contract-compliant (R2):
Provider Score
Gemini 98%
Claude 81%
ChatGPT 64%
Average 81%
+38 points. Same AI. Same question. Different structural constraints.
The contract doesn't make the AI smarter. It makes the AI's output auditable against a defined standard.
Then the human intervened. Once.
81% — but the output read like a legal document. Every claim cited, scoped, hedged. Epistemically reliable. Practically unreadable.
One instruction:
"Soften the tone. Don't reintroduce any claims the truth filter removed."
Session 2, softened (R3):
Provider Score
Gemini 95%
Claude 75%
ChatGPT 57%
Average 75%
Down 6 points. More readable. Still grounded.
We chose 75%. Not because it's better than 81%. Because 75% is the right trade-off — readable enough to share, grounded enough to trust. We submitted 75% to gem2 calibration as our standard for narrative AI forecasts.
Human reads the audit. Human decides the trade-off. Human defines the standard.Human reads the audit. Human decides the trade-off. Human defines the standard.Enter fullscreen mode
Exit fullscreen mode
Not reviewing every line. Not trusting blindly. Deciding at the right moment.
What the full arc looks like
Session 1 (no filter) → 35% avg Session 2 (no filter) → 43% avg Contract applied (R2) → 81% avg Human softened (R3) → 75% avg ← our standardSession 1 (no filter) → 35% avg Session 2 (no filter) → 43% avg Contract applied (R2) → 81% avg Human softened (R3) → 75% avg ← our standardEnter fullscreen mode
Exit fullscreen mode
Truth is not the score. Truth is the pattern of drift. You define the standard.Truth is not the score. Truth is the pattern of drift. You define the standard.Enter fullscreen mode
Exit fullscreen mode
The workflow: AI audits AI
Human asks → AI executes AI verifies AI → AI fixes AI Human decides at the edgeHuman asks → AI executes AI verifies AI → AI fixes AI Human decides at the edgeEnter fullscreen mode
Exit fullscreen mode
The verification layer — gem2_truth_filter, tpmn_contract_writer, the composer — runs between generation and delivery. The human sees the audit result, decides the acceptable trade-off, sets the calibration standard.
Human-in-the-loop means the human is the bottleneck — every output passes through before it ships. That doesn't scale. Human-at-the-edge means you define "acceptable" once, and the system enforces it automatically. You intervene only when a genuine judgment call is required — like choosing 75% over 81%.
TPMN is not a checker
TPMN is not a validator, a linter, or a hallucination detector.
TPMN is an epistemic gauge.
Enter fullscreen mode
Exit fullscreen mode
It shows what's grounded, what's inferred, what's extrapolated. It fingerprints why outputs differ across sessions. It generates the contracts that stabilize structure. It collects human calibration signals and turns them into a standard.
It doesn't decide. You do.
We're calling the full suite GEM2 Epistemic Studio — 15 tools across four functional groups: analysis, contract authoring, calibration, and execution. TPMN Checker is one group inside it.
Try it on your own output
-
Paste any AI output into your conversation.
-
Ask: "Verify this by gem2 truth filter."
-
Read the score. See what's grounded vs extrapolated.
-
Ask: "Create a grounded replacement prompt using gem2 contract writer."
-
Run it again. Watch the difference.
Your AI picks the right tool from 15 available MCP tools automatically. No configuration. No TPMN knowledge required.
The goal isn't a higher score. It's a score you understand and a standard you chose.
→ Try it free at gemsquared.ai
What comes after prompting
The industry is still in the prompting era. Better prompts, longer context, chain-of-thought — all useful, all insufficient.
The next step isn't better prompting. It's verification as infrastructure.
AI generates. AI verifies. AI refines. Human decides at the edge.AI generates. AI verifies. AI refines. Human decides at the edge.Enter fullscreen mode
Exit fullscreen mode
We didn't make AI smarter. We made it accountable.
That's measurable: 35% → 75% on the same task, with the same AI, using nothing but a formal contract and one human judgment call.
GEM2 Epistemic Studio — 15 tools, 6 domains, 3 providers. Free to start.
Built by Inseok Seo (David) — GEM²-AI
→ gemsquared.ai → TPMN-PSL Specification (open, CC-BY 4.0) → GitHub → Part 1: Three AIs analyzed our product → Part 2: Human at the edge
Dev.to AI
https://dev.to/gemsquared/same-prompt-different-answers-every-time-heres-how-i-fixed-it-1ce1Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
claudegeminimodel
PRISM: LLM-Guided Semantic Clustering for High-Precision Topics
arXiv:2604.03180v1 Announce Type: cross Abstract: In this paper, we propose Precision-Informed Semantic Modeling (PRISM), a structured topic modeling framework combining the benefits of rich representations captured by LLMs with the low cost and interpretability of latent semantic clustering methods. PRISM fine-tunes a sentence encoding model using a sparse set of LLM- provided labels on samples drawn from some corpus of interest. We segment this embedding space with thresholded clustering, yielding clusters that separate closely related topics within some narrow domain. Across multiple corpora, PRISM improves topic separability over state-of-the-art local topic models and even over clustering on large, frontier embedding models while requiring only a small number of LLM queries to train.

User-Aware Conditional Generative Total Correlation Learning for Multi-Modal Recommendation
arXiv:2604.03014v1 Announce Type: new Abstract: Multi-modal recommendation (MMR) enriches item representations by introducing item content, e.g., visual and textual descriptions, to improve upon interaction-only recommenders. The success of MMR hinges on aligning these content modalities with user preferences derived from interaction data, yet dominant practices based on disentangling modality-invariant preference-driving signals from modality-specific preference-irrelevant noises are flawed. First, they assume a one-size-fits-all relevance of item content to user preferences for all users, which contradicts the user-conditional fact of preferences. Second, they optimize pairwise contrastive losses separately toward cross-modal alignment, systematically ignoring higher-order dependencies i

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference
arXiv:2604.02985v1 Announce Type: new Abstract: With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30,000 queries across several open-source LLMs
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

User-Aware Conditional Generative Total Correlation Learning for Multi-Modal Recommendation
arXiv:2604.03014v1 Announce Type: new Abstract: Multi-modal recommendation (MMR) enriches item representations by introducing item content, e.g., visual and textual descriptions, to improve upon interaction-only recommenders. The success of MMR hinges on aligning these content modalities with user preferences derived from interaction data, yet dominant practices based on disentangling modality-invariant preference-driving signals from modality-specific preference-irrelevant noises are flawed. First, they assume a one-size-fits-all relevance of item content to user preferences for all users, which contradicts the user-conditional fact of preferences. Second, they optimize pairwise contrastive losses separately toward cross-modal alignment, systematically ignoring higher-order dependencies i

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference
arXiv:2604.02985v1 Announce Type: new Abstract: With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30,000 queries across several open-source LLMs

Bilateral Intent-Enhanced Sequential Recommendation with Embedding Perturbation-Based Contrastive Learning
arXiv:2604.02833v1 Announce Type: new Abstract: Accurately modeling users' evolving preferences from sequential interactions remains a central challenge in recommender systems. Recent studies emphasize the importance of capturing multiple latent intents underlying user behaviors. However, existing methods often fail to effectively exploit collective intent signals shared across users and items, leading to information isolation and limited robustness. Meanwhile, current contrastive learning approaches struggle to construct views that are both semantically consistent and sufficiently discriminative. In this work, we propose BIPCL, an end-to-end Bilateral Intent-enhanced, Embedding Perturbation-based Contrastive Learning framework. BIPCL explicitly integrates multi-intent signals into both it


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!