Analysis of LLM Performance on AWS Bedrock: Receipt-item Categorisation Case Study
Hey there, little explorer! 🚀
Imagine you have a big pile of toys, and you need to sort them into different boxes, like "cars," "blocks," and "dolls."
This grown-up story is like that! Scientists tested different smart computer friends (we call them "AI helpers") to see which one is best at sorting things from a shopping list. Like, is "milk" a "food" or a "drink"?
They found one helper, named Claude, who was super good at sorting and didn't cost too much computer money. So, Claude is like the best toy-sorter friend! Yay Claude! 🎉
arXiv:2604.01615v1 Announce Type: cross Abstract: This paper presents a systematic, cost-aware evaluation of large language models (LLMs) for receipt-item categorisation within a production-oriented classification framework. We compare four instruction-tuned models available through AWS Bedrock: Claude 3.7 Sonnet, Claude 4 Sonnet, Mixtral 8x7B Instruct, and Mistral 7B Instruct. The aim of the study was (1) to assess performance across accuracy, response stability, and token-level cost, and (2) to investigate what prompting methods, zero-shot or few-shot, are especially appropriate both in terms of accuracy and in terms of incurred costs. Results of our experiments demonstrated that Claude 3.7 Sonnet achieves the most favourable balance between classification accuracy and cost efficiency.
View PDF HTML (experimental)
Abstract:This paper presents a systematic, cost-aware evaluation of large language models (LLMs) for receipt-item categorisation within a production-oriented classification framework. We compare four instruction-tuned models available through AWS Bedrock: Claude 3.7 Sonnet, Claude 4 Sonnet, Mixtral 8x7B Instruct, and Mistral 7B Instruct. The aim of the study was (1) to assess performance across accuracy, response stability, and token-level cost, and (2) to investigate what prompting methods, zero-shot or few-shot, are especially appropriate both in terms of accuracy and in terms of incurred costs. Results of our experiments demonstrated that Claude 3.7 Sonnet achieves the most favourable balance between classification accuracy and cost efficiency.
Comments: Preprint. Accepted to the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2026). Final version to be published by SCITEPRESS, this http URL
Subjects:
Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Cite as: arXiv:2604.01615 [cs.AI]
(or arXiv:2604.01615v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2604.01615
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Maria Spichkova [view email] [v1] Thu, 2 Apr 2026 04:50:11 UTC (1,155 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.







Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!