APITestGenie: Generating Web API Tests from Requirements and API Specifications with LLMs
arXiv:2604.02039v1 Announce Type: new Abstract: Modern software systems rely heavily on Web APIs, yet creating meaningful and executable test scripts remains a largely manual, time-consuming, and error-prone task. In this paper, we present APITestGenie, a novel tool that leverages Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and prompt engineering to automatically generate API integration tests directly from business requirements and OpenAPI specifications. We evaluated APITestGenie on 10 real-world APIs, including 8 APIs comprising circa 1,000 live endpoints from an industrial partner in the automotive domain. The tool was able to generate syntactically and semantically valid test scripts for 89\% of the business requirements under test after at most three attempts.
View PDF HTML (experimental)
Abstract:Modern software systems rely heavily on Web APIs, yet creating meaningful and executable test scripts remains a largely manual, time-consuming, and error-prone task. In this paper, we present APITestGenie, a novel tool that leverages Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and prompt engineering to automatically generate API integration tests directly from business requirements and OpenAPI specifications. We evaluated APITestGenie on 10 real-world APIs, including 8 APIs comprising circa 1,000 live endpoints from an industrial partner in the automotive domain. The tool was able to generate syntactically and semantically valid test scripts for 89% of the business requirements under test after at most three attempts. Notably, some generated tests revealed previously unknown defects in the APIs, including integration issues between endpoints. Statistical analysis identified API complexity and level of detail in business requirements as primary factors influencing success rates, with the level of detail in API documentation also affecting outcomes. Feedback from industry practitioners confirmed strong interest in adoption, substantially reducing the manual effort in writing acceptance tests, and improving the alignment between tests and business requirements.
Subjects:
Software Engineering (cs.SE)
Cite as: arXiv:2604.02039 [cs.SE]
(or arXiv:2604.02039v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2604.02039
arXiv-issued DOI via DataCite (pending registration)
Journal reference: 7th ACM/IEEE International Conference on Automation of Software Test (AST 2026)
Related DOI:
https://doi.org/10.1145/3793654.3793743
DOI(s) linking to related resources
Submission history
From: Bruno Lima Mr. [view email] [v1] Thu, 2 Apr 2026 13:43:56 UTC (470 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelannounce
Summarization model doesn't work
I try to run this below code (provided in Hugging Face’s LLMs course, lesson: Transformers, what can they do?) from transformers import pipeline summarize = pipeline("summarization") summarize( """ America has changed dramatically during recent years. Not only has the number of graduates in traditional engineering disciplines such as mechanical, civil, electrical, chemical, and aeronautical engineering declined, but in most of the premier American universities engineering curricula now concentrate on and encourage largely the study of engineering science. As a result, there are declining offerings in engineering subjects dealing with infrastructure, the environment, and related issues, and greater concentration on high technology subjects, largely supporting increasingly complex scientific

Scaling Agentic Memory to 5 Billion Vectors via Binary Quantization and Dynamic Wavelet Matrices
In a study, a new “dynamic wavelet matrix” was used as a vector database, where the memory grows only with log(σ) instead of with n. I considered building a KNN model with a huge memory, capable of holding, for example, 5 billion vectors. First, the words in the context window are converted into an embedding using deberta-v3-small. This is a fast encoder that also takes the position of the tokens into account (disentangled attention) and is responsible for the context in the model. The embedding is then converted into a bit sequence using binary quantization, where dimensions greater than 0 are converted to 1 and otherwise to 0. The advantage is that bit sequences are compressible and are entered into the dynamic wavelet matrix, where the memory grows only with log(σ). A response token is

Seedance 2.0 vs Sora 2: I Tested Both with Identical Prompts — Here's the Full Breakdown
When I started building with AI video APIs, the first question was obvious: which model should I default to? Spec comparisons didn’t help much. So I ran the same prompts through both Seedance 2.0 and Sora 2 and compared what actually came out. Three tests, three different failure modes: Physics realism — destruction and particle dynamics Fast motion + hard lighting — complex human movement under challenging conditions Character + emotion — subtle facial transitions All tests used identical prompts. Both models accessed through EvoLink’s unified API . Test Setup Variable Setup Prompting The same prompt for both models in each test Goal Compare output behavior, not marketing claims Focus areas Physics, motion coherence, lighting, facial detail, and audio behavior Reading rule We judge what a
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!