Voice-to-Schema: Turning "Track My Invoices" Into a Real Table
We rebuilt our NLP pipeline three times before it actually worked. Here's what went wrong each time and what we learned about the gap between what people say and what they mean. The Problem Nobody Talks About When we started building VoiceTables , we had a simple hypothesis: let people describe what they need, and generate a structured table from that description. User says "I need to track my invoices," system creates an invoices table with sensible columns. Easy, right? Turns out spoken language and structured data are almost completely different things. The first version took about two weeks to build and maybe 20 minutes to realize it was broken. Attempt 1: Naive Prompt Engineering The first pipeline was embarrassingly simple. Take the transcript, send it to an LLM with a system prompt
We rebuilt our NLP pipeline three times before it actually worked. Here's what went wrong each time and what we learned about the gap between what people say and what they mean.
The Problem Nobody Talks About
When we started building VoiceTables, we had a simple hypothesis: let people describe what they need, and generate a structured table from that description. User says "I need to track my invoices," system creates an invoices table with sensible columns. Easy, right?
Turns out spoken language and structured data are almost completely different things. The first version took about two weeks to build and maybe 20 minutes to realize it was broken.
Attempt 1: Naive Prompt Engineering
The first pipeline was embarrassingly simple. Take the transcript, send it to an LLM with a system prompt like "extract a table schema from this description," parse the JSON response.
It worked perfectly for clean inputs. "Create a table with columns: client name, invoice amount, due date, status" produced exactly what you'd expect.
Nobody talks like that.
Real inputs looked like this:
-
"uh, I need something for... like tracking stuff, you know, for my clients"
-
"make me a table, invoice things, the usual"
-
"so I've got these freelance gigs and I keep losing track of who paid me"
The third one is actually the most useful input. It tells you what the user needs (payment tracking), who they're working with (freelance clients), and what the pain point is (losing track of payments). But our v1 pipeline couldn't extract any of that. It would either hallucinate random columns or return a generic two-column table that helped nobody.
Attempt 2: Two-Stage Extraction
For the second version, we split the pipeline into two steps:
-
Intent extraction: figure out what domain the user is working in (invoicing, project management, inventory, etc.) and what they actually want to track
-
Schema generation: given the intent, generate an appropriate table structure
This was better. The intent layer caught things like "freelance gigs" mapping to freelancer invoicing, which gave the schema generator much better context.
But we hit a new problem: ambiguity in spoken language vs. typed input.
When someone types "client name," they mean a text column called "Client Name." When someone says "client name," they might mean:
-
The name of their client (text column)
-
A reference to an existing clients table (foreign key)
-
"the client" as filler while they think about what to say next
Spoken language has pauses, restarts, filler words, self-corrections. "I need a table for... no wait... like, a list of my clients and their... the projects, and how much each one... you know, the budget for each project."
That sentence contains at least three potential columns (client, project, budget) and a relationship (clients have projects). Our v2 would sometimes generate six columns because it treated "no wait" and "you know" as potential data points.
Attempt 3: What Actually Works
The third rewrite introduced something we should have done from the start: a confidence-scored extraction with a clarification loop.
The pipeline now works like this:
-
Transcript cleanup: strip filler words, normalize speech patterns, handle self-corrections ("no wait" means "ignore what I just said")
-
Entity extraction with confidence: each potential column gets a confidence score. "Budget" from the sentence above gets 0.9. "Projects" gets 0.85. "The" gets filtered out entirely.
-
Schema proposal: generate the table structure, but only include columns above a confidence threshold
-
Gap detection: identify what's probably missing. If someone mentions invoices but no date column, that's a gap worth asking about.
The key insight was that we don't need to get it perfect on the first pass. We just need to get it good enough that the user can see what we understood and correct us quickly. "I see you want to track invoices for freelance clients. I've set up columns for Client Name, Project, Amount, and Status. Want me to add anything else?"
That conversational correction loop is way more natural than trying to parse everything perfectly from a single voice input.
Failure Modes We Still Deal With
It's not all solved. Some recurring edge cases:
Language mixing. We have users who switch between languages mid-sentence. Our pipeline handles Czech and English separately, but code-switching mid-sentence still trips it up sometimes.
Implicit schemas. "Make it like a CRM" assumes shared knowledge about what a CRM table looks like. We built a library of common schema templates (CRM, invoice tracker, project board, inventory) that the intent layer can match against. It covers maybe 70% of cases.
Overspecification. Some users describe every single column in detail, including data types and validation rules, all in one breath. The pipeline gets confused because it's optimized for the messy, underspecified case. We're still tuning the balance.
Numbers
After the third rewrite:
-
First-pass schema accuracy went from ~40% (v1) to ~78% (v3) measured by "did the user accept the generated schema without modifications"
-
Average time from voice input to usable table dropped from asking 3+ clarification questions to usually 0-1
-
The cleanup step alone (stripping fillers, handling corrections) improved extraction accuracy by about 15 percentage points
What I'd Do Differently
If I were starting from scratch, I'd skip the "parse everything from one input" approach entirely. Start with the clarification loop from day one. People are surprisingly patient with "let me make sure I understood you" if the follow-up question is smart.
Also, collect real voice inputs as early as possible. We spent two weeks optimizing for typed test inputs that looked nothing like actual speech. The gap between "create a table with columns name, email, phone" and "uh yeah so I need like a contacts thing" is massive, and you won't close it without real data.
I'm building VoiceTables as part of the Inithouse portfolio, where we ship AI-powered tools across different verticals. Some of our other projects include Be Recommended (check if AI chatbots recommend your brand), Watching Agents (AI prediction platform), and Audit Vibecoding (automated audits for AI-generated code). If you're building voice-first interfaces or have war stories about NLP pipelines, I'd love to hear about it in the comments.
Dev.to AI
https://dev.to/jakub_inithouse/voice-to-schema-turning-track-my-invoices-into-a-real-table-1b4aSign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
versionplatformprediction
AIGP-Σ: A Post-Quantum Identity and Authorization Protocol for Autonomous AI Agents
Hey guys, What do you think about this? Tell me your true opinion about what i created Zenodo AIGP-Σ: A Post-Quantum Identity and Authorization Protocol for Autonomous AI... AIGP-Σ (AI Governance Protocol — Sigma) is a post-quantum cryptographic identity and authorization framework designed for autonomous AI agents operating in multi-agent and agentic payment environments. The protocol suite consists of five... 1 post - 1 participant Read full topic

Docling Studio — open-source visual inspection tool for Docling pipelines
Hey everyone I built Docling Studio , an open-source visual inspection layer for Docling. The problem: if you’ve used Docling, you know the extraction engine is powerful — but validating outputs means digging through JSON and mentally mapping bounding box coordinates back to the original pages. No visual feedback loop. What Docling Studio does: Upload a PDF, configure your pipeline (OCR engine, table extraction, enrichment) Run the conversion Visually inspect every detected element — bounding boxes overlaid on original pages, element types, content preview on click Two modes: local (embedded Docling) or remote (Docling Serve) Stack: Vue 3 / TypeScript + FastAPI / Python, fully Dockerized (multi-arch), 180+ tests. Why it matters for RAG workflows: without seeing what Docling extracts, it’s
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.






Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!