CrossTrace: A Cross-Domain Dataset of Grounded Scientific Reasoning Traces for Hypothesis Generation
arXiv:2603.28924v1 Announce Type: new Abstract: Scientific hypothesis generation is a critical bottleneck in accelerating research, yet existing datasets for training and evaluating hypothesis-generating models are limited to single domains and lack explicit reasoning traces connecting prior knowledge to novel contributions. I introduce CrossTrace, a dataset of 1,389 grounded scientific reasoning traces spanning biomedical research (518), AI/ML (605), and cross-domain work (266). Each trace captures the structured reasoning chain from established knowledge through intermediate logical steps to a novel hypothesis, with every step grounded in source paper text. I define an Input/Trace/Output schema that extends the Bit-Flip-Spark framework of HypoGen with step-level verification, a taxonomy
View PDF HTML (experimental)
Abstract:Scientific hypothesis generation is a critical bottleneck in accelerating research, yet existing datasets for training and evaluating hypothesis-generating models are limited to single domains and lack explicit reasoning traces connecting prior knowledge to novel contributions. I introduce CrossTrace, a dataset of 1,389 grounded scientific reasoning traces spanning biomedical research (518), AI/ML (605), and cross-domain work (266). Each trace captures the structured reasoning chain from established knowledge through intermediate logical steps to a novel hypothesis, with every step grounded in source paper text. I define an Input/Trace/Output schema that extends the Bit-Flip-Spark framework of HypoGen with step-level verification, a taxonomy of eight discovery patterns, and multi-domain coverage. Fine-tuning Qwen2.5-7B-Instruct on CrossTrace via QLoRA yields substantial improvements over the untuned baseline: IAScore rises from 0.828 to 0.968 (GPT-4o judge) and from 0.716 to 0.888 (Claude Opus 4.5), structural compliance improves from 0% to 100%, and spark cosine similarity increases from 0.221 to 0.620. Balanced cross-domain training (biomedical + AI/ML + CS) outperforms single-domain training, providing evidence that scientific reasoning patterns transfer across disciplines. Human validation of 150 stratified records confirms 99.7% step-level grounding accuracy and a 0.0% fabrication rate. To my knowledge, CrossTrace is the first large-scale, cross-domain dataset with step-level grounded reasoning traces for hypothesis generation, and my results demonstrate that such traces are an effective training signal whose benefits are at least partially domain-general.
Comments: 14 pages, 1 figure, 8 tables. Dataset and code available at this https URL
Subjects:
Computation and Language (cs.CL)
Cite as: arXiv:2603.28924 [cs.CL]
(or arXiv:2603.28924v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.28924
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Andrew Bouras [view email] [v1] Mon, 30 Mar 2026 18:55:55 UTC (115 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
claudemodeltrainingThe Stages of AI Grief
<blockquote> <p><strong>Assumed audience:</strong> People who work with AI daily — or are starting to — and have complicated feelings about it.</p> </blockquote> <p>I don't think I've ever had so much fun in my programming career as I do now. Which is strange, because a few weeks ago I was in a very different place. I was watching - in horror - as the machine on my desk was taking over my craft. Like most people I guess, I derive quite a lot of my identity from that craft; hence the horror. (Let's ignore for now whether that's a good thing or not.)</p> <p>I just watched it melt away. Like a block of ice in the sun; inexorable. In that moment it felt like I was witnessing an emerging god: an uncontrollable force in the sky asserts its influence over all it touches, and every day, it touches
Implementing Zero Trust Architecture for Unmanaged IoT at the Network Edge
<h2> Why Unmanaged IoT Is the Weakest Link in Your Network </h2> <p>The proliferation of Internet of Things (IoT) devices across enterprise environments has created a security paradox. Organizations deploy thousands of connected devices—IP cameras, building automation controllers, medical equipment, industrial sensors, point-of-sale terminals—to drive operational efficiency. Yet the vast majority of these devices are <strong>unmanaged</strong>: they cannot run endpoint agents, accept security patches on schedule, or participate in traditional identity frameworks. According to industry estimates, over 75% of IoT devices in production environments operate without any form of endpoint security.</p> <p>This creates a massive blind spot. Traditional perimeter-based security assumes that everyth
Transforming Raspberry Pi into an AI-Native Edge IDS for SMBs
<h2> The SMB Security Gap: Why the Edge Matters </h2> <p>Small and Medium Businesses (SMBs) are frequently described as the "soft underbelly" of the global supply chain. While large enterprises invest millions in centralized Security Operations Centers (SOCs) and high-end hardware, SMBs often operate with lean IT teams and limited budgets. However, the threats they face—ranging from sophisticated ransomware-as-a-service to targeted lateral movement—are just as potent. The traditional approach of backhauling all traffic to a central firewall is increasingly obsolete in a world of distributed work and IoT expansion. This is where <strong>how to set up IDS on raspberry pi</strong> becomes a critical question for cost-conscious security engineers.</p> <p>In the contemporary digital ecosystem,
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
The Stages of AI Grief
<blockquote> <p><strong>Assumed audience:</strong> People who work with AI daily — or are starting to — and have complicated feelings about it.</p> </blockquote> <p>I don't think I've ever had so much fun in my programming career as I do now. Which is strange, because a few weeks ago I was in a very different place. I was watching - in horror - as the machine on my desk was taking over my craft. Like most people I guess, I derive quite a lot of my identity from that craft; hence the horror. (Let's ignore for now whether that's a good thing or not.)</p> <p>I just watched it melt away. Like a block of ice in the sun; inexorable. In that moment it felt like I was witnessing an emerging god: an uncontrollable force in the sky asserts its influence over all it touches, and every day, it touches
Unlock the Power of Private AI: Build a Local RAG Pipeline with LangGraph, Ollama & Vector Databases
<blockquote> <p>I created a new website: <a href="https://programmingcentral.vercel.app/books/typescript/" rel="noopener noreferrer">Free Access to the 8 Volumes on Typescript & AI Masterclass</a>, no registration required. Choose Volume and chapter on the menu on the left. 160 Chapters and hundreds of quizzes at the end of chapters.</p> </blockquote> <p>Retrieval-Augmented Generation (RAG) is revolutionizing how we interact with AI, allowing models to provide more informed and contextually relevant answers. But what if you need to keep your data private and secure? This guide dives into building a <strong>Private RAG pipeline</strong> – a self-contained AI system that operates entirely on your machine, leveraging local embeddings, vector stores, and Large Language Models (LLMs). We'll
Stop Using Elaborate Personas: Research Shows They Degrade Claude Code Output
<blockquote> <p>Scientific research reveals common Claude Code prompting practices—like elaborate personas and multi-agent teams—are measurably wrong and hurt performance.</p> </blockquote> <h1> Stop Using Elaborate Personas: Research Shows They Degrade Claude Code Output </h1> <p>A developer who read 17 academic papers on agentic AI workflows has published findings that contradict much of the common advice circulating in the Claude Code community. The research-backed principles suggest developers are actively harming their output quality with popular prompting patterns.</p> <h2> What The Research Says — Counterintuitive Findings </h2> <p>The key findings, distilled from papers including PRISM persona research and DeepMind (2025) studies, are actionable for any Claude Code user:</p> <ol> <
Interpretability and implicit model semantics in biomedicine and deep learning - Nature
<a href="https://news.google.com/rss/articles/CBMiX0FVX3lxTE1zT3N6bS1OX2V5VXFUSldsVnJYVXJrSW1iS1NtSVpoMVVDUHVHcnkwSEFwc0lkYzc4VEpZMUJveGM1WjB6SU1UdUcxS2ViOGk0WFpKeXVlMDROeEFwelhz?oc=5" target="_blank">Interpretability and implicit model semantics in biomedicine and deep learning</a> <font color="#6f6f6f">Nature</font>
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!