Models model benchmark update feature report github

Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability

Dev.to AIby Akhona ElandApril 5, 20265 min read4 views

Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability You validate your LLM outputs with Pydantic. The JSON is well-formed. The fields are correct. Life is good. Then your model returns a "polite decline" that says "I'd rather gouge my eyes out." It passes your type checks. It fails the vibe check. This is the Semantic Gap — the space between structural correctness and actual meaning . Every team shipping LLM-powered features hits it eventually. I got tired of hitting it, so I built Semantix . The Semantic Gap: Shape vs. Meaning Here's what most validation looks like today: class Response ( BaseModel ): message : str tone : Literal [ " polite " , " neutral " , " firm " ] This tells you the shape is right. It tells you nothing about whether the meaning is right.

Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability

You validate your LLM outputs with Pydantic. The JSON is well-formed. The fields are correct. Life is good.

Then your model returns a "polite decline" that says "I'd rather gouge my eyes out."

It passes your type checks. It fails the vibe check.

This is the Semantic Gap — the space between structural correctness and actual meaning. Every team shipping LLM-powered features hits it eventually. I got tired of hitting it, so I built Semantix.

The Semantic Gap: Shape vs. Meaning

Here's what most validation looks like today:

class Response(BaseModel):  message: str  tone: Literal["polite", "neutral", "firm"]

class Response(BaseModel):  message: str  tone: Literal["polite", "neutral", "firm"]

Enter fullscreen mode

Exit fullscreen mode

This tells you the shape is right. It tells you nothing about whether the meaning is right. Your model can return {"message": "Go away.", "tone": "polite"} and Pydantic will happily accept it.

Semantix flips the script. Instead of validating structure, you validate intent:

from semantix import Intent, validate_intent

class ProfessionalDecline(Intent): """The text must politely decline an invitation without being rude or aggressive."""

@validate_intent def decline_invite(event: str) -> ProfessionalDecline: return call_my_llm(event)`

Enter fullscreen mode

Exit fullscreen mode

The docstring is the contract. A judge (LLM-based, NLI, or embedding) reads the output, reads the requirement, and decides: does this text actually do what it claims?

What's New in v0.1.3: The Self-Healing Update

Informed Self-Healing

The biggest feature in v0.1.3 is informed retries. When an LLM output fails validation, the decorator doesn't just retry blindly — it tells the LLM exactly what went wrong.

Declare a semantix_feedback parameter in your function, and the decorator injects a structured Markdown report on each retry:

from typing import Optional from semantix import validate_intent from semantix.judges.nli import NLIJudge

from typing import Optional from semantix import validate_intent from semantix.judges.nli import NLIJudge

@validate_intent(judge=NLIJudge(), retries=2) def decline(event: str, semantix_feedback: Optional[str] = None) -> ProfessionalDecline: prompt = f"Decline this invite: {event}" if semantix_feedback: prompt += f"\n\n{semantix_feedback}" return call_llm(prompt)`

Enter fullscreen mode

Exit fullscreen mode

On the first call, semantix_feedback is None. If validation fails, the next call receives something like:

## Semantix Self-Healing Feedback

Attempt 1 failed validation.

What went wrong

Intent: ProfessionalDecline
Score: 0.3210 (threshold not met)
Judge reason: too vague

What is required

The text must politely decline an invitation without being rude or aggressive.

Your previous output (rejected)

Go away.

Please generate a new response that satisfies the requirement above.`

Enter fullscreen mode

Exit fullscreen mode

The LLM gets the score, the reason, the requirement, and its own rejected output. It can learn from the failure in real time.

NLI as the Default Judge

We moved from LLMJudge to NLIJudge as the default. Why?

No API key required — runs fully locally using a cross-encoder model
Entailment > Cosine similarity — NLI asks "does A entail B?" which is fundamentally the right question for intent validation. Cosine similarity asks "are A and B about the same thing?" which is a weaker signal
Fast enough — the default nli-MiniLM2-L6-H768 model is ~85MB and runs in milliseconds

You can still use any judge you want — LLMJudge, EmbeddingJudge, or your own custom Judge subclass.

Granular Scoring

LLMJudge no longer returns a binary Yes/No. It now returns a 0.0-1.0 confidence score and a text reason, giving the self-healing system richer feedback to work with.

The Proof: Benchmark Results

Talk is cheap. Here are the real numbers from tools/benchmark.py, comparing single-shot validation (no retries) against Semantix self-healing (2 retries with feedback injection):

Scenario No Healing Self-Healing Improvement

Professional Tone 13.3% 56.7% +43.3%

Technical Explanation 36.7% 96.7% +60.0%

Actionable Summary 13.3% 56.7% +43.3%

Overall 21.1% 70.0% +48.9%

Self-healing nearly triples the overall success rate. For technical explanations specifically, it pushes reliability from 36.7% to 96.7%.

These numbers are from a simulated LLM with a 40% baseline quality rate. Real LLMs start higher, so the absolute numbers will be better — but the relative improvement from self-healing holds.

How It Works Under the Hood

Your Function  |  v @validate_intent  |  v Call function -> Get raw string  |  v Judge.evaluate(output, intent_description, threshold)  |  +-- PASS --> return Intent(output)  |  +-- FAIL --> SemanticIntentError  |  v  retries left?  |  +-- YES --> inject semantix_feedback -> retry  |  +-- NO --> raise error

Your Function  |  v @validate_intent  |  v Call function -> Get raw string  |  v Judge.evaluate(output, intent_description, threshold)  |  +-- PASS --> return Intent(output)  |  +-- FAIL --> SemanticIntentError  |  v  retries left?  |  +-- YES --> inject semantix_feedback -> retry  |  +-- NO --> raise error

Enter fullscreen mode

Exit fullscreen mode

The decorator resolves the Intent subclass from your return type annotation, calls the judge, and manages the retry loop. The semantix_feedback injection is zero-boilerplate — just add the parameter and it works.

Get Started in 30 Seconds

pip install "semantix-ai[nli]"

Enter fullscreen mode

Exit fullscreen mode

from semantix import Intent, validate_intent

class PositiveSentiment(Intent): """The text must express a clearly positive, optimistic, or encouraging sentiment."""

@validate_intent(retries=2) def encourage(name: str, semantix_feedback=None) -> PositiveSentiment: prompt = f"Write an encouraging message for {name}" if semantix_feedback: prompt += f"\n\n{semantix_feedback}" return call_your_llm(prompt)`

Enter fullscreen mode

Exit fullscreen mode

That's it. Your LLM output is now semantically typed and self-healing.

More about

modelbenchmarkupdate

ProductsLive

Bolt expands its Hopp ride-hailing brand into Canadian corporate travel

A year after its consumer launch in Toronto, Bolt’s North American brand Hopp has introduced a corporate mobility product targeting finance teams frustrated by fragmented expense reporting. It enters a market where Canada’s business travel spending was forecast to grow 17.7% to CAD $44.3 billion in 2025. Bolt, the Estonian mobility company that operates in [ ] This story continues at The Next Web

The Next Web Neural

1mabout 1 hour ago

CountriesFresh

UK Gov Reportedly Tempting Anthropic for UK Expansion

According to reports from the Financial Times, the AI giant, which has faced a recent bout of turbulence from its spat with the US Pentagon over its AI defence contract to the recent leak of its source code is being courted by the UK to expand its footing in the country. The UK [ ] The post UK Gov Reportedly Tempting Anthropic for UK Expansion appeared first on DIGIT .

Digit.fyi

1mabout 5 hours ago

ProductsLive

98% of Firms Struggling to Manage Wireless as AI Explodes

Wi-Fi has evolved into a strategic growth engine delivering exponential value for enterprises, according to new research from Cisco, to the extent that a single network investment drives returns across employee productivity, customer engagement, and revenue. Polling more than 6,000 global wireless professionals, Cisco’s latest State of Wireless report found that 80% of large businesses [ ] The post 98% of Firms Struggling to Manage Wireless as AI Explodes appeared first on DIGIT .

Digit.fyi

1mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 258 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsRecent

Benchmarks and methods for 3D medical image retrieval

Nature Machine Learning

1mabout 15 hours ago

ModelsLive

This Windows 98-inspired launcher on Apple App Store was designed with Claude AI as an experiment - The Financial Express

This Windows 98-inspired launcher on Apple App Store was designed with Claude AI as an experiment The Financial Express

GNews AI Apple

1mabout 2 hours ago

ModelsFresh

Empirical Evaluation of Structured Synthetic Data Privacy Metrics: Novel experimental framework

arXiv:2512.16284v2 Announce Type: replace Abstract: Synthetic data generation is gaining traction as a privacy enhancing technology (PET). When properly generated, synthetic data preserve the analytic utility of real data while avoiding the retention of information that would allow the identification of specific individuals. However, the concept of data privacy remains elusive, making it challenging for practitioners to evaluate and benchmark the degree of privacy protection offered by synthetic data. In this paper, we propose a framework to empirically assess the efficacy of tabular synthetic data privacy quantification methods through controlled, deliberate risk insertion. To demonstrate this framework, we survey existing approaches to synthetic data privacy quantification and the relate

arXiv cs.CR

1mabout 11 hours ago

ModelsLive

A technical, 100% local writeup on how I replicated and then surpassed the Secret Detection model from Wiz (and the challenges along the way) - including labeling an entire dataset with local AI

Hey everybody, I have a strong interest in offloading work to small, specialized models that I can parallelize - this lets me scale work significantly (plus, I am less dependent on proprietary APIs) Some time ago, I saw a blog post from Wiz about fine-tuning Llama 3.2-1B for secret detection in code. They got 86% Precision and 82% Recall. I wanted to see if I can replicate (or beat) those numbers using purely local AI and produce a local specialized model. After a couple of weekends of trying it out I managed to get a Llama 3.2-1B hitting 88% Precision and 84.4% Recall simultaneously! I also benchmarked Qwen 3.5-2B and 4B - expectedly, they outperformed Llama 1B at the cost of more VRAM and longer inference time. I’ve put together a full write-up with the training stats, examples, and a st

Reddit r/LocalLLaMA

2mabout 2 hours ago

Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability

Your LLM Passes Type Checks but Fails the "Vibe Check": How I Fixed AI Reliability

The Semantic Gap: Shape vs. Meaning

What's New in v0.1.3: The Self-Healing Update

Informed Self-Healing

What went wrong

What is required

Your previous output (rejected)

NLI as the Default Judge

Granular Scoring

The Proof: Benchmark Results

How It Works Under the Hood

Get Started in 30 Seconds

Links

Daily AI Digest

More about

Bolt expands its Hopp ride-hailing brand into Canadian corporate travel

UK Gov Reportedly Tempting Anthropic for UK Expansion

98% of Firms Struggling to Manage Wireless as AI Explodes

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Benchmarks and methods for 3D medical image retrieval

This Windows 98-inspired launcher on Apple App Store was designed with Claude AI as an experiment - The Financial Express

Empirical Evaluation of Structured Synthetic Data Privacy Metrics: Novel experimental framework

A technical, 100% local writeup on how I replicated and then surpassed the Secret Detection model from Wiz (and the challenges along the way) - including labeling an entire dataset with local AI