Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessI Stress-Tested PAIO for OpenClaw: Faster Setup, Lower Token Use, Better Security?DEV CommunitySources: AI startup Poolside held talks with Google and others to revive a Texas data center project after a CoreWeave deal and a $2B Nvidia-led round collapsed (Stephen Morris/Financial Times)TechmemeSystematically dismantle the AI compute supply chain.LessWrong AI🚀 I Built an API Documentation Generator That Works in 5 SecondsDEV CommunitySum, Count, and Reverse of Digits in Python (While Loop & Recursion)DEV CommunityWhen LangChain Is Enough: How to Build Useful AI Apps Without OverengineeringDEV CommunityThe Evolution of Natural Language Processing: A Journey from 1960 to 2020DEV CommunityApple Just Killed a $100M Vibe Coding App. Here's the Security Angle Nobody's Talking About.DEV CommunitySamsung SDS Unveils AI, Digital Twin Logistics Innovations at 2026 Conference - 조선일보GNews AI SamsungImplementing ECDSA from Scratch Without LibrariesDEV CommunityMachine Learning in Blockchain for AI Engineers and Blockchain Developers - Blockchain CouncilGoogle News: Machine LearningGitHub Issue Template: How to Get More Contributions and Build CommunityDEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessI Stress-Tested PAIO for OpenClaw: Faster Setup, Lower Token Use, Better Security?DEV CommunitySources: AI startup Poolside held talks with Google and others to revive a Texas data center project after a CoreWeave deal and a $2B Nvidia-led round collapsed (Stephen Morris/Financial Times)TechmemeSystematically dismantle the AI compute supply chain.LessWrong AI🚀 I Built an API Documentation Generator That Works in 5 SecondsDEV CommunitySum, Count, and Reverse of Digits in Python (While Loop & Recursion)DEV CommunityWhen LangChain Is Enough: How to Build Useful AI Apps Without OverengineeringDEV CommunityThe Evolution of Natural Language Processing: A Journey from 1960 to 2020DEV CommunityApple Just Killed a $100M Vibe Coding App. Here's the Security Angle Nobody's Talking About.DEV CommunitySamsung SDS Unveils AI, Digital Twin Logistics Innovations at 2026 Conference - 조선일보GNews AI SamsungImplementing ECDSA from Scratch Without LibrariesDEV CommunityMachine Learning in Blockchain for AI Engineers and Blockchain Developers - Blockchain CouncilGoogle News: Machine LearningGitHub Issue Template: How to Get More Contributions and Build CommunityDEV Community

DS-STAR: A state-of-the-art versatile data science agent

Google Research BlogNovember 6, 20251 min read0 views
Source Quiz

Data Mining & Modeling

Data science is a field dedicated to transforming raw data into meaningful, actionable insights, playing an essential role in solving real-world challenges. Businesses often depend on data-driven insights to make pivotal strategic decisions. However, the data science process is frequently complex, demanding a high level of expertise in fields like computer science and statistics. This workflow consists of many time-intensive activities, from interpreting various documents to performing complex data processing and statistical analysis.

To streamline this complex workflow, recent research has focused on using off-the-shelf large language models (LLMs) to create autonomous data science agents. The goal of these agents is to convert natural language questions into executable code for a desired task. But despite making significant progress, current data science agents have several limitations that hinder their practical use. A major issue is their heavy reliance on well-structured data, like CSV files in relational databases. This limited focus ignores the valuable information contained in the diverse and heterogeneous data formats, such as JSON, unstructured text, and markdown files, that are common in real-world applications. Another challenge is that many data science problems are open-ended and lack ground-truth labels, making it difficult to verify if an agent's reasoning is correct.

To that end, we present DS-STAR, a new agent designed to solve data science problems. DS-STAR introduces three key innovations: (1) a data file analysis module that automatically extracts context from varied data formats, including unstructured ones; (2) a verification stage where an LLM-based judge assesses the plan’s sufficiency at each step; and (3) a sequential planning process that iteratively refines the initial plan based on feedback. This iterative refinement allows DS-STAR to handle complex analyses that draw verifiable insights from multiple data sources. We demonstrate that DS-STAR achieves state-of-the-art performance on challenging benchmarks like DABStep, KramaBench, and DA-Code. It especially excels with tasks involving diverse, heterogeneous data files.

DS-STAR

The DS-STAR framework operates in two main stages. First, it automatically examines all files in a directory and creates a textual summary of their structure and contents. This summary becomes a vital source of context for tackling the task at hand.

Second, DS-STAR engages in a primary loop of planning, implementing, and verifying. The Planner agent first creates a high-level plan, which the Coder agent then transforms into a code script. Subsequently, the Verifier agent evaluates the code's effectiveness in solving the problem. The Verifier agent is an LLM-based judge prompted to determine if the current plan is adequate. If the judge finds the plan insufficient, DS-STAR refines it by altering or adding steps (determined by the Router agent) and then repeats the cycle. Importantly, DS-STAR uses a method that mimics how an expert analyst uses tools like Google colab to build a plan sequentially, reviewing intermediate results before proceeding. This iterative cycle continues until a plan is deemed satisfactory or the maximum number of rounds (10) is reached, at which point the final code is delivered as the solution.

Evaluation

To evaluate DS-STAR’s effectiveness, we compared its performance to existing state-of-the-art methods (AutoGen, DA-Agent) using a set of well-regarded data science benchmarks, DABStep, KramaBench, and DA-Code. These benchmarks evaluate performance on complex tasks like data wrangling, machine learning, and visualization that use multiple data sources and formats.

The results show that DS-STAR substantially outperforms AutoGen and DA-Agent in all test scenarios. Compared to the best alternative, DS-STAR raised the accuracy from 41.0% to 45.2% on DABStep, 39.8% to 44.7% on KramaBench, and 37.0% to 38.5% on DA-Code. Notably, DS-STAR also secured the top rank on the public leaderboard for the DABStep benchmark (as of 9/18/2025). On both easy tasks (where the answer is in a single file) and hard tasks (requiring multiple files), DS-STAR consistently surpasses competing baselines, demonstrating its superior ability to work with multiple, heterogeneous data sources.

In-depth analysis of DS-STAR

Next, we conducted ablation studies to verify the effectiveness of DS-STAR’s individual components and analyze the impact of the number of refinement rounds, specifically by measuring the iterations required to generate a sufficient plan.

Data File Analyzer: This agent is essential for high performance. Without the descriptions it generates (Variant 1), DS-STAR's accuracy on difficult tasks within the DABStep benchmark sharply dropped to 26.98%, underscoring the importance of rich data context for effective planning and implementation.

Router: The Router agent’s ability to determine if a new step is needed or to fix an incorrect step is vital. When we removed it (Variant 2), DS-STAR only added new steps sequentially, leading to worse performance on both easy and hard tasks. This demonstrated that it is more effective to correct mistakes in a plan than to keep adding potentially flawed steps.

Generalizability Across LLMs: We also tested DS-STAR's adaptability by using GPT-5 as the base model. This yielded promising results on the DABStep benchmark, indicating the framework's generalizability. Interestingly, DS-STAR with GPT-5 performed better on easy tasks, while the Gemini-2.5-Pro version performed better on hard tasks.

An analysis of the refinement process: The figure below shows that difficult tasks naturally require more iterations. On the DABStep benchmark, hard tasks needed an average of 5.6 rounds to solve, whereas easy tasks required only 3.0 rounds. Furthermore, over half of the easy tasks were completed in just a single round.

Conclusion

In this work, we introduced DS-STAR, a new agent that can autonomously solve data science problems. The framework is defined by two core innovations: the automatic analysis of diverse file formats and an iterative, sequential planning process that uses a novel LLM-based verification system. DS-STAR establishes a new state-of-the-art on the DABStep, KramaBench, and DA-Code benchmarks, outperforming the best alternative. By automating complex data science tasks, DS-STAR has the potential to make data science more accessible for individuals and organizations, helping to drive innovation across many different fields.

Acknowledgements

We would like to thank Jiefeng Chen, Jinwoo Shin, Raj Sinha, Mihir Parmar, George Lee, Vishy Tirumalashetty, Tomas Pfister and Burak Gokturk for their valuable contributions to this work.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelagent

Knowledge Map

Knowledge Map
TopicsEntitiesSource
DS-STAR: A …modelagentGoogle Rese…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 145 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models