Research Papers model language model announce reasoning paper arxiv

Beyond Symbolic Solving: Multi Chain-of-Thought Voting for Geometric Reasoning in Large Language Models

ArXiv CS.AIby [Submitted on 1 Apr 2026]April 2, 20262 min read1 views

arXiv:2604.00890v1 Announce Type: new Abstract: Geometric Problem Solving (GPS) remains at the heart of enhancing mathematical reasoning in large language models because it requires the combination of diagrammatic understanding, symbolic manipulation and logical inference. In existing literature, researchers have chiefly focused on synchronising the diagram descriptions with text literals and solving the problem. In this vein, they have either taken a neural, symbolic or neuro-symbolic approach. But this solves only the first two of the requirements, namely diagrammatic understanding and symbolic manipulation, while leaving logical inference underdeveloped. The logical inference is often limited to one chain-of-thought (CoT). To address this weakness in hitherto existing models, this paper

View PDF HTML (experimental)

Abstract:Geometric Problem Solving (GPS) remains at the heart of enhancing mathematical reasoning in large language models because it requires the combination of diagrammatic understanding, symbolic manipulation and logical inference. In existing literature, researchers have chiefly focused on synchronising the diagram descriptions with text literals and solving the problem. In this vein, they have either taken a neural, symbolic or neuro-symbolic approach. But this solves only the first two of the requirements, namely diagrammatic understanding and symbolic manipulation, while leaving logical inference underdeveloped. The logical inference is often limited to one chain-of-thought (CoT). To address this weakness in hitherto existing models, this paper proposes MARS-GPS, that generates multiple parallel reasoning rollouts augmented with Python code execution for numerical verification, ranks them using token-level entropy as a confidence signal, and aggregates answers through a multi-stage voting and self-verification pipeline. Empirical results show that MARS-GPS with 8 parallel rollouts achieves 88.8% on Geometry3K, a nearly +11% improvement over the prior state-of-the-art, with accuracy scaling consistently as the number of rollouts increases from 1 to 16 (+6.0% on ablation subset). We provide our code and data in an anonymous repository: this https URL.

Comments: Under review, 4 figures, 7 tables

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2604.00890 [cs.AI]

(or arXiv:2604.00890v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.00890

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Syed Rifat Raiyan [view email] [v1] Wed, 1 Apr 2026 13:37:06 UTC (144 KB)

Original source

ArXiv CS.AI

https://arxiv.org/abs/2604.00890

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelannounce

ProductsLive

Am I the baddie?

I am a software engineer. I work for a company that makes software for road construction. Monday last week we were under a bad crunch and we were told to start using agentic workflows. We had like 50 tickets to close by the following Tuesday. I’ve been experimenting with ai development for years now, but this was different. I had access to Opus/Sonnet 4.6, and GPT5.4—the latest models. Suddenly, they understood. I could talk about abstract concept’s and analogies, and it got them. I was soon working through tickets the first day in hours, what would have taken me days. But we still had a ton of work and not enough time. I was still bound to a single thread of work at a time. So like any problem, I hacked around it. I started with a worktree, where it basically creates a whole other copy of

lesswrong.com

4m23 minutes ago

ModelsRecent

A diffusion model conditioned on compound bioactivity profiles for generating high-content images

Nature Machine Learning

1m1 day ago

ModelsRecent

Disentangled autoencoding equivariant diffusion model for controlled generation of 3D molecules

Nature Machine Learning

1m1 day ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 184 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

Are Finer Citations Always Better? Rethinking Granularity for Attributed Generation

arXiv:2604.01432v1 Announce Type: new Abstract: Citation granularity - whether to cite individual sentences, paragraphs, or documents - is a critical design choice in attributed generation. While fine-grained citations are often preferred for precise human verification, their impact on model performance remains under-explored. We analyze four model scales (8B-120B) and demonstrate that enforcing fine-grained citations degrades attribution quality by 16-276% compared to the best-performing granularity. We observe a consistent performance pattern where attribution quality peaks at intermediate granularities (paragraph-level). Our analysis suggests that fine-grained (sentence-level) citations disrupt necessary semantic dependencies for attributing evidence to answer claims, while excessively

arXiv cs.CL

1mabout 2 hours ago

Research PapersFresh

The power of context: Random Forest classification of near synonyms. A case study in Modern Hindi

arXiv:2604.01425v1 Announce Type: new Abstract: Synonymy is a widespread yet puzzling linguistic phenomenon. Absolute synonyms theoretically should not exist, as they do not expand language's expressive potential. However, it was suggested that even if synonyms denote the same concept, they may reflect different perspectives or carry distinct cultural associations, claims that have rarely been tested quantitatively. In Hindi, prolonged contact with Persian produced many Perso-Arabic loanwords coexisting with their Sanskrit counterpart, forming numerous synonym pairs. This study investigates whether centuries after these borrowings appeared in the Subcontinent their origin can still be distinguished using distributional data alone and regardless of their semantic content. A Random Forest tr

arXiv cs.CL

2mabout 2 hours ago

Research PapersFresh

Assessing Pause Thresholds for empirical Translation Process Research

arXiv:2604.01410v1 Announce Type: new Abstract: Text production (and translations) proceeds in the form of stretches of typing, interrupted by keystroke pauses. It is often assumed that fast typing reflects unchallenged/automated translation production while long(er) typing pauses are indicative of translation problems, hurdles or difficulties. Building on a long discussion concerning the determination of pause thresholds that separate automated from presumably reflective translation processes (O'Brien, 2006; Alves and Vale, 2009; Timarova et al., 2011; Dragsted and Carl, 2013; Lacruz et al., 2014; Kumpulainen, 2015; Heilmann and Neumann 2016), this paper compares three recent approaches for computing these pause thresholds, and suggest and evaluate a novel method for computing Production

arXiv cs.CL

1mabout 2 hours ago

Research PapersFresh

[R], 31 MILLIONS High frequency data, Light GBM worked perfectly

We just published a paper on predicting adverse selection in high-frequency crypto markets using LightGBM , and I wanted to share it here because the findings are directly relevant to anyone dealing high frequency data and machine learning The core problem we solved: Every market maker's nightmare — getting picked off by informed traders right before a big move. We built a model that flags those toxic seconds before they wreck you. The data: - 31,081,463 second-level observations of BTC/USDT perpetual futures on Bybit - February 2025 → February 2026 (381 raw daily files) - Strict walk-forward regime, zero lookahead bias The key results (this is the part that shocked us): Our TailScore metric — which combines predicted toxicity probability with predicted price move severity — flags the top

Reddit r/MachineLearning

2mabout 2 hours ago