DS-STAR: A state-of-the-art versatile data science agent

Google Research BlogNovember 6, 20251 min read0 views

Data Mining & Modeling

Data science is a field dedicated to transforming raw data into meaningful, actionable insights, playing an essential role in solving real-world challenges. Businesses often depend on data-driven insights to make pivotal strategic decisions. However, the data science process is frequently complex, demanding a high level of expertise in fields like computer science and statistics. This workflow consists of many time-intensive activities, from interpreting various documents to performing complex data processing and statistical analysis.

To streamline this complex workflow, recent research has focused on using off-the-shelf large language models (LLMs) to create autonomous data science agents. The goal of these agents is to convert natural language questions into executable code for a desired task. But despite making significant progress, current data science agents have several limitations that hinder their practical use. A major issue is their heavy reliance on well-structured data, like CSV files in relational databases. This limited focus ignores the valuable information contained in the diverse and heterogeneous data formats, such as JSON, unstructured text, and markdown files, that are common in real-world applications. Another challenge is that many data science problems are open-ended and lack ground-truth labels, making it difficult to verify if an agent's reasoning is correct.

To that end, we present DS-STAR, a new agent designed to solve data science problems. DS-STAR introduces three key innovations: (1) a data file analysis module that automatically extracts context from varied data formats, including unstructured ones; (2) a verification stage where an LLM-based judge assesses the plan’s sufficiency at each step; and (3) a sequential planning process that iteratively refines the initial plan based on feedback. This iterative refinement allows DS-STAR to handle complex analyses that draw verifiable insights from multiple data sources. We demonstrate that DS-STAR achieves state-of-the-art performance on challenging benchmarks like DABStep, KramaBench, and DA-Code. It especially excels with tasks involving diverse, heterogeneous data files.

DS-STAR

The DS-STAR framework operates in two main stages. First, it automatically examines all files in a directory and creates a textual summary of their structure and contents. This summary becomes a vital source of context for tackling the task at hand.

Second, DS-STAR engages in a primary loop of planning, implementing, and verifying. The Planner agent first creates a high-level plan, which the Coder agent then transforms into a code script. Subsequently, the Verifier agent evaluates the code's effectiveness in solving the problem. The Verifier agent is an LLM-based judge prompted to determine if the current plan is adequate. If the judge finds the plan insufficient, DS-STAR refines it by altering or adding steps (determined by the Router agent) and then repeats the cycle. Importantly, DS-STAR uses a method that mimics how an expert analyst uses tools like Google colab to build a plan sequentially, reviewing intermediate results before proceeding. This iterative cycle continues until a plan is deemed satisfactory or the maximum number of rounds (10) is reached, at which point the final code is delivered as the solution.

Evaluation

To evaluate DS-STAR’s effectiveness, we compared its performance to existing state-of-the-art methods (AutoGen, DA-Agent) using a set of well-regarded data science benchmarks, DABStep, KramaBench, and DA-Code. These benchmarks evaluate performance on complex tasks like data wrangling, machine learning, and visualization that use multiple data sources and formats.

The results show that DS-STAR substantially outperforms AutoGen and DA-Agent in all test scenarios. Compared to the best alternative, DS-STAR raised the accuracy from 41.0% to 45.2% on DABStep, 39.8% to 44.7% on KramaBench, and 37.0% to 38.5% on DA-Code. Notably, DS-STAR also secured the top rank on the public leaderboard for the DABStep benchmark (as of 9/18/2025). On both easy tasks (where the answer is in a single file) and hard tasks (requiring multiple files), DS-STAR consistently surpasses competing baselines, demonstrating its superior ability to work with multiple, heterogeneous data sources.

In-depth analysis of DS-STAR

Next, we conducted ablation studies to verify the effectiveness of DS-STAR’s individual components and analyze the impact of the number of refinement rounds, specifically by measuring the iterations required to generate a sufficient plan.

Data File Analyzer: This agent is essential for high performance. Without the descriptions it generates (Variant 1), DS-STAR's accuracy on difficult tasks within the DABStep benchmark sharply dropped to 26.98%, underscoring the importance of rich data context for effective planning and implementation.

Router: The Router agent’s ability to determine if a new step is needed or to fix an incorrect step is vital. When we removed it (Variant 2), DS-STAR only added new steps sequentially, leading to worse performance on both easy and hard tasks. This demonstrated that it is more effective to correct mistakes in a plan than to keep adding potentially flawed steps.

Generalizability Across LLMs: We also tested DS-STAR's adaptability by using GPT-5 as the base model. This yielded promising results on the DABStep benchmark, indicating the framework's generalizability. Interestingly, DS-STAR with GPT-5 performed better on easy tasks, while the Gemini-2.5-Pro version performed better on hard tasks.

An analysis of the refinement process: The figure below shows that difficult tasks naturally require more iterations. On the DABStep benchmark, hard tasks needed an average of 5.6 rounds to solve, whereas easy tasks required only 3.0 rounds. Furthermore, over half of the easy tasks were completed in just a single round.

Conclusion

In this work, we introduced DS-STAR, a new agent that can autonomously solve data science problems. The framework is defined by two core innovations: the automatic analysis of diverse file formats and an iterative, sequential planning process that uses a novel LLM-based verification system. DS-STAR establishes a new state-of-the-art on the DABStep, KramaBench, and DA-Code benchmarks, outperforming the best alternative. By automating complex data science tasks, DS-STAR has the potential to make data science more accessible for individuals and organizations, helping to drive innovation across many different fields.

Acknowledgements

We would like to thank Jiefeng Chen, Jinwoo Shin, Raj Sinha, Mihir Parmar, George Lee, Vishy Tirumalashetty, Tomas Pfister and Burak Gokturk for their valuable contributions to this work.

Original source

Google Research Blog

https://research.google/blog/ds-star-a-state-of-the-art-versatile-data-science-agent/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelagent

Models

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench

By Shaina Raza and Veronica Chatrath AI models are rapidly becoming bigger, faster, and more capable at understanding images and text together. However, while accuracy and speed are often celebrated, […] The post When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench appeared first on Vector Institute for Artificial Intelligence .

Vector Institute

1m8 months ago

Models

The New Cartography of the Invisible

By John Knechtel From the telescope to the balance sheet – a Foundation Models for Science Workshop recap relates how scientists can help businesses solve their most stubborn data problems and […] The post The New Cartography of the Invisible appeared first on Vector Institute for Artificial Intelligence .

Vector Institute

1m2 months ago

Models

CRISPNAM-FG: An interpretable Fine-Gray deep survival model for competing risks in health care

Vector researchers developed CRISPNAM-FG, a trustworthy AI model that predicts the risk of developing diabetes-related foot complications for patients discharged from hospitals while providing complete transparency in how each decision […] The post CRISPNAM-FG: An interpretable Fine-Gray deep survival model for competing risks in health care appeared first on Vector Institute for Artificial Intelligence .

Vector Institute

1mabout 1 month ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 145 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench

Vector Institute

1m8 months ago

ModelsRecent

From Workshop to Waterway: Marine Innovation Redefines Naval Strike Capabilities

Marines assigned to the III Expeditionary Operations Training Group at Okinawa, Japan, and U.S. Naval Special Warfare Command operators executed the Marine Corps' first live-fire drone strike against a maritime surface vessel from a naval surface craft.

defense.gov

1mabout 12 hours ago

Models

The New Cartography of the Invisible

Vector Institute

1m2 months ago

Models

CRISPNAM-FG: An interpretable Fine-Gray deep survival model for competing risks in health care

Vector Institute

1mabout 1 month ago

DS-STAR: A state-of-the-art versatile data science agent

DS-STAR

Evaluation

In-depth analysis of DS-STAR

Conclusion

Acknowledgements

Daily AI Digest

More about

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens &#8211; Introducing HumaniBench

The New Cartography of the Invisible

CRISPNAM-FG: An interpretable Fine-Gray deep survival model for competing risks in health care

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens &#8211; Introducing HumaniBench

From Workshop to Waterway: Marine Innovation Redefines Naval Strike Capabilities

The New Cartography of the Invisible

CRISPNAM-FG: An interpretable Fine-Gray deep survival model for competing risks in health care

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench

When AI Meets Human Matters: Evaluating Multimodal Models Through a Human-Centred Lens – Introducing HumaniBench