Models claude mistral model language model announce available

Analysis of LLM Performance on AWS Bedrock: Receipt-item Categorisation Case Study

arXiv cs.SEby [Submitted on 2 Apr 2026]April 3, 20261 min read2 views

🧒Explain Like I'm 5Simple language

Hey there, little explorer! 🚀

Imagine you have a big pile of toys, and you need to sort them into different boxes, like "cars," "blocks," and "dolls."

This grown-up story is like that! Scientists tested different smart computer friends (we call them "AI helpers") to see which one is best at sorting things from a shopping list. Like, is "milk" a "food" or a "drink"?

They found one helper, named Claude, who was super good at sorting and didn't cost too much computer money. So, Claude is like the best toy-sorter friend! Yay Claude! 🎉

arXiv:2604.01615v1 Announce Type: cross Abstract: This paper presents a systematic, cost-aware evaluation of large language models (LLMs) for receipt-item categorisation within a production-oriented classification framework. We compare four instruction-tuned models available through AWS Bedrock: Claude 3.7 Sonnet, Claude 4 Sonnet, Mixtral 8x7B Instruct, and Mistral 7B Instruct. The aim of the study was (1) to assess performance across accuracy, response stability, and token-level cost, and (2) to investigate what prompting methods, zero-shot or few-shot, are especially appropriate both in terms of accuracy and in terms of incurred costs. Results of our experiments demonstrated that Claude 3.7 Sonnet achieves the most favourable balance between classification accuracy and cost efficiency.

View PDF HTML (experimental)

Abstract:This paper presents a systematic, cost-aware evaluation of large language models (LLMs) for receipt-item categorisation within a production-oriented classification framework. We compare four instruction-tuned models available through AWS Bedrock: Claude 3.7 Sonnet, Claude 4 Sonnet, Mixtral 8x7B Instruct, and Mistral 7B Instruct. The aim of the study was (1) to assess performance across accuracy, response stability, and token-level cost, and (2) to investigate what prompting methods, zero-shot or few-shot, are especially appropriate both in terms of accuracy and in terms of incurred costs. Results of our experiments demonstrated that Claude 3.7 Sonnet achieves the most favourable balance between classification accuracy and cost efficiency.

Comments: Preprint. Accepted to the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2026). Final version to be published by SCITEPRESS, this http URL

Subjects:

Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

Cite as: arXiv:2604.01615 [cs.AI]

(or arXiv:2604.01615v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.01615

arXiv-issued DOI via DataCite (pending registration)