Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability
Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability You validate your LLM outputs with Pydantic. The JSON is well-formed. The fields are correct. Life is good. Then your model returns a "polite decline" that says "I'd rather gouge my eyes out." It passes your type checks. It fails the vibe check. This is the Semantic Gap — the space between structural correctness and actual meaning . Every team shipping LLM-powered features hits it eventually. I got tired of hitting it, so I built Semantix . The Semantic Gap: Shape vs. Meaning Here's what most validation looks like today: class Response ( BaseModel ): message : str tone : Literal [ " polite " , " neutral " , " firm " ] This tells you the shape is right. It tells you nothing about whether the meaning is right.
Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability
You validate your LLM outputs with Pydantic. The JSON is well-formed. The fields are correct. Life is good.
Then your model returns a "polite decline" that says "I'd rather gouge my eyes out."
It passes your type checks. It fails the vibe check.
This is the Semantic Gap — the space between structural correctness and actual meaning. Every team shipping LLM-powered features hits it eventually. I got tired of hitting it, so I built Semantix.
The Semantic Gap: Shape vs. Meaning
Here's what most validation looks like today:
class Response(BaseModel): message: str tone: Literal["polite", "neutral", "firm"]class Response(BaseModel): message: str tone: Literal["polite", "neutral", "firm"]Enter fullscreen mode
Exit fullscreen mode
This tells you the shape is right. It tells you nothing about whether the meaning is right. Your model can return {"message": "Go away.", "tone": "polite"} and Pydantic will happily accept it.
Semantix flips the script. Instead of validating structure, you validate intent:
from semantix import Intent, validate_intent
class ProfessionalDecline(Intent): """The text must politely decline an invitation without being rude or aggressive."""
@validate_intent def decline_invite(event: str) -> ProfessionalDecline: return call_my_llm(event)`
Enter fullscreen mode
Exit fullscreen mode
The docstring is the contract. A judge (LLM-based, NLI, or embedding) reads the output, reads the requirement, and decides: does this text actually do what it claims?
What's New in v0.1.3: The Self-Healing Update
Informed Self-Healing
The biggest feature in v0.1.3 is informed retries. When an LLM output fails validation, the decorator doesn't just retry blindly — it tells the LLM exactly what went wrong.
Declare a semantix_feedback parameter in your function, and the decorator injects a structured Markdown report on each retry:
from typing import Optional from semantix import validate_intent from semantix.judges.nli import NLIJudgefrom typing import Optional from semantix import validate_intent from semantix.judges.nli import NLIJudge@validate_intent(judge=NLIJudge(), retries=2) def decline(event: str, semantix_feedback: Optional[str] = None) -> ProfessionalDecline: prompt = f"Decline this invite: {event}" if semantix_feedback: prompt += f"\n\n{semantix_feedback}" return call_llm(prompt)`
Enter fullscreen mode
Exit fullscreen mode
On the first call, semantix_feedback is None. If validation fails, the next call receives something like:
## Semantix Self-Healing Feedback
Attempt 1 failed validation.
What went wrong
- Intent:
ProfessionalDecline - Score: 0.3210 (threshold not met)
- Judge reason: too vague
What is required
The text must politely decline an invitation without being rude or aggressive.
Your previous output (rejected)
Go away.
Please generate a new response that satisfies the requirement above.`
Enter fullscreen mode
Exit fullscreen mode
The LLM gets the score, the reason, the requirement, and its own rejected output. It can learn from the failure in real time.
NLI as the Default Judge
We moved from LLMJudge to NLIJudge as the default. Why?
-
No API key required — runs fully locally using a cross-encoder model
-
Entailment > Cosine similarity — NLI asks "does A entail B?" which is fundamentally the right question for intent validation. Cosine similarity asks "are A and B about the same thing?" which is a weaker signal
-
Fast enough — the default nli-MiniLM2-L6-H768 model is ~85MB and runs in milliseconds
You can still use any judge you want — LLMJudge, EmbeddingJudge, or your own custom Judge subclass.
Granular Scoring
LLMJudge no longer returns a binary Yes/No. It now returns a 0.0-1.0 confidence score and a text reason, giving the self-healing system richer feedback to work with.
The Proof: Benchmark Results
Talk is cheap. Here are the real numbers from tools/benchmark.py, comparing single-shot validation (no retries) against Semantix self-healing (2 retries with feedback injection):
Scenario No Healing Self-Healing Improvement
Professional Tone 13.3% 56.7% +43.3%
Technical Explanation 36.7% 96.7% +60.0%
Actionable Summary 13.3% 56.7% +43.3%
Overall 21.1% 70.0% +48.9%
Self-healing nearly triples the overall success rate. For technical explanations specifically, it pushes reliability from 36.7% to 96.7%.
These numbers are from a simulated LLM with a 40% baseline quality rate. Real LLMs start higher, so the absolute numbers will be better — but the relative improvement from self-healing holds.
How It Works Under the Hood
Your Function | v @validate_intent | v Call function -> Get raw string | v Judge.evaluate(output, intent_description, threshold) | +-- PASS --> return Intent(output) | +-- FAIL --> SemanticIntentError | v retries left? | +-- YES --> inject semantix_feedback -> retry | +-- NO --> raise errorYour Function | v @validate_intent | v Call function -> Get raw string | v Judge.evaluate(output, intent_description, threshold) | +-- PASS --> return Intent(output) | +-- FAIL --> SemanticIntentError | v retries left? | +-- YES --> inject semantix_feedback -> retry | +-- NO --> raise errorEnter fullscreen mode
Exit fullscreen mode
The decorator resolves the Intent subclass from your return type annotation, calls the judge, and manages the retry loop. The semantix_feedback injection is zero-boilerplate — just add the parameter and it works.
Get Started in 30 Seconds
pip install "semantix-ai[nli]"
Enter fullscreen mode
Exit fullscreen mode
from semantix import Intent, validate_intent
class PositiveSentiment(Intent): """The text must express a clearly positive, optimistic, or encouraging sentiment."""
@validate_intent(retries=2) def encourage(name: str, semantix_feedback=None) -> PositiveSentiment: prompt = f"Write an encouraging message for {name}" if semantix_feedback: prompt += f"\n\n{semantix_feedback}" return call_your_llm(prompt)`
Enter fullscreen mode
Exit fullscreen mode
That's it. Your LLM output is now semantically typed and self-healing.
Links
-
GitHub: github.com/labrat-akhona/semantix-ai
-
PyPI: pypi.org/project/semantix-ai
-
Install: pip install semantix-ai
Star the repo if this is useful. Open an issue if it isn't — I want to know what's missing.
Built by Akhona Eland in South Africa.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelbenchmarkupdate
Bolt expands its Hopp ride-hailing brand into Canadian corporate travel
A year after its consumer launch in Toronto, Bolt’s North American brand Hopp has introduced a corporate mobility product targeting finance teams frustrated by fragmented expense reporting. It enters a market where Canada’s business travel spending was forecast to grow 17.7% to CAD $44.3 billion in 2025. Bolt, the Estonian mobility company that operates in [ ] This story continues at The Next Web

UK Gov Reportedly Tempting Anthropic for UK Expansion
According to reports from the Financial Times, the AI giant, which has faced a recent bout of turbulence from its spat with the US Pentagon over its AI defence contract to the recent leak of its source code is being courted by the UK to expand its footing in the country. The UK [ ] The post UK Gov Reportedly Tempting Anthropic for UK Expansion appeared first on DIGIT .

98% of Firms Struggling to Manage Wireless as AI Explodes
Wi-Fi has evolved into a strategic growth engine delivering exponential value for enterprises, according to new research from Cisco, to the extent that a single network investment drives returns across employee productivity, customer engagement, and revenue. Polling more than 6,000 global wireless professionals, Cisco’s latest State of Wireless report found that 80% of large businesses [ ] The post 98% of Firms Struggling to Manage Wireless as AI Explodes appeared first on DIGIT .
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Empirical Evaluation of Structured Synthetic Data Privacy Metrics: Novel experimental framework
arXiv:2512.16284v2 Announce Type: replace Abstract: Synthetic data generation is gaining traction as a privacy enhancing technology (PET). When properly generated, synthetic data preserve the analytic utility of real data while avoiding the retention of information that would allow the identification of specific individuals. However, the concept of data privacy remains elusive, making it challenging for practitioners to evaluate and benchmark the degree of privacy protection offered by synthetic data. In this paper, we propose a framework to empirically assess the efficacy of tabular synthetic data privacy quantification methods through controlled, deliberate risk insertion. To demonstrate this framework, we survey existing approaches to synthetic data privacy quantification and the relate

A technical, 100% local writeup on how I replicated and then surpassed the Secret Detection model from Wiz (and the challenges along the way) - including labeling an entire dataset with local AI
Hey everybody, I have a strong interest in offloading work to small, specialized models that I can parallelize - this lets me scale work significantly (plus, I am less dependent on proprietary APIs) Some time ago, I saw a blog post from Wiz about fine-tuning Llama 3.2-1B for secret detection in code. They got 86% Precision and 82% Recall. I wanted to see if I can replicate (or beat) those numbers using purely local AI and produce a local specialized model. After a couple of weekends of trying it out I managed to get a Llama 3.2-1B hitting 88% Precision and 84.4% Recall simultaneously! I also benchmarked Qwen 3.5-2B and 4B - expectedly, they outperformed Llama 1B at the cost of more VRAM and longer inference time. I’ve put together a full write-up with the training stats, examples, and a st




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!