Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessWhat is the effect on the Human mind from AI?discuss.huggingface.coUnderstanding Token Classification in NLP: NER, POS Tagging & Chunking ExplainedMedium AIIntroducing ForestFire, a new tree-learning libraryMedium AIBuy Verified Coinbase Accounts - 100% active and safeDev.to AI90% людей используют нейросети как поисковик. И проигрывают.Dev.to AIContinuing the idea of building a one-person unicorn, it is important to recognize that this…Medium AIHow to Build an AI Content Playbook That Actually Protects Your VoiceDev.to AIExploring Early Web Patterns for Modern AI Agent DevelopmentDev.to AIUnderstanding NLP Token Classification : A Beginner-Friendly GuideMedium AIHow Do You Actually Scale High-Throughput LLM Serving in Production with vLLM?Medium AIGemma 4 and the On-Device AI Revolution No One Prepared You ForDev.to AI5 Claude Entrances That Doubled My Workflow EfficiencyDev.to AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessWhat is the effect on the Human mind from AI?discuss.huggingface.coUnderstanding Token Classification in NLP: NER, POS Tagging & Chunking ExplainedMedium AIIntroducing ForestFire, a new tree-learning libraryMedium AIBuy Verified Coinbase Accounts - 100% active and safeDev.to AI90% людей используют нейросети как поисковик. И проигрывают.Dev.to AIContinuing the idea of building a one-person unicorn, it is important to recognize that this…Medium AIHow to Build an AI Content Playbook That Actually Protects Your VoiceDev.to AIExploring Early Web Patterns for Modern AI Agent DevelopmentDev.to AIUnderstanding NLP Token Classification : A Beginner-Friendly GuideMedium AIHow Do You Actually Scale High-Throughput LLM Serving in Production with vLLM?Medium AIGemma 4 and the On-Device AI Revolution No One Prepared You ForDev.to AI5 Claude Entrances That Doubled My Workflow EfficiencyDev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

Analysis of LLM Performance on AWS Bedrock: Receipt-item Categorisation Case Study

arXiv cs.SEby [Submitted on 2 Apr 2026]April 3, 20261 min read2 views
Source Quiz
🧒Explain Like I'm 5Simple language

Hey there, little explorer! 🚀

Imagine you have a big pile of toys, and you need to sort them into different boxes, like "cars," "blocks," and "dolls."

This grown-up story is like that! Scientists tested different smart computer friends (we call them "AI helpers") to see which one is best at sorting things from a shopping list. Like, is "milk" a "food" or a "drink"?

They found one helper, named Claude, who was super good at sorting and didn't cost too much computer money. So, Claude is like the best toy-sorter friend! Yay Claude! 🎉

arXiv:2604.01615v1 Announce Type: cross Abstract: This paper presents a systematic, cost-aware evaluation of large language models (LLMs) for receipt-item categorisation within a production-oriented classification framework. We compare four instruction-tuned models available through AWS Bedrock: Claude 3.7 Sonnet, Claude 4 Sonnet, Mixtral 8x7B Instruct, and Mistral 7B Instruct. The aim of the study was (1) to assess performance across accuracy, response stability, and token-level cost, and (2) to investigate what prompting methods, zero-shot or few-shot, are especially appropriate both in terms of accuracy and in terms of incurred costs. Results of our experiments demonstrated that Claude 3.7 Sonnet achieves the most favourable balance between classification accuracy and cost efficiency.

View PDF HTML (experimental)

Abstract:This paper presents a systematic, cost-aware evaluation of large language models (LLMs) for receipt-item categorisation within a production-oriented classification framework. We compare four instruction-tuned models available through AWS Bedrock: Claude 3.7 Sonnet, Claude 4 Sonnet, Mixtral 8x7B Instruct, and Mistral 7B Instruct. The aim of the study was (1) to assess performance across accuracy, response stability, and token-level cost, and (2) to investigate what prompting methods, zero-shot or few-shot, are especially appropriate both in terms of accuracy and in terms of incurred costs. Results of our experiments demonstrated that Claude 3.7 Sonnet achieves the most favourable balance between classification accuracy and cost efficiency.

Comments: Preprint. Accepted to the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2026). Final version to be published by SCITEPRESS, this http URL

Subjects:

Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

Cite as: arXiv:2604.01615 [cs.AI]

(or arXiv:2604.01615v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.01615

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Maria Spichkova [view email] [v1] Thu, 2 Apr 2026 04:50:11 UTC (1,155 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Analysis of…claudemistralmodellanguage mo…announceavailablearXiv cs.SE

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 159 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!