On Neural Scaling Laws for Weather Emulation through Continual Training
Neural scaling laws, which in some domains can predict the performance of large neural networks as a function of model, data, and compute scale, are the cornerstone of building foundation models in Natural Language Processing and Computer Vision. We study neural scaling in Scientific Machine Learning, focusing on models for weather forecasting. To analyze scaling behavior in as simple a setting as possible, we adopt a minimal, scalable, general-purpose Swin Transformer architecture, and we use continual training with constant learning rates and periodic cooldowns as an efficient training strat — Shashank Subramanian, Alexander Kiefer, Arnur Nigmetov
View PDF HTML (experimental)
Abstract:Neural scaling laws, which in some domains can predict the performance of large neural networks as a function of model, data, and compute scale, are the cornerstone of building foundation models in Natural Language Processing and Computer Vision. We study neural scaling in Scientific Machine Learning, focusing on models for weather forecasting. To analyze scaling behavior in as simple a setting as possible, we adopt a minimal, scalable, general-purpose Swin Transformer architecture, and we use continual training with constant learning rates and periodic cooldowns as an efficient training strategy. We show that models trained in this minimalist way follow predictable scaling trends and even outperform standard cosine learning rate schedules. Cooldown phases can be re-purposed to improve downstream performance, e.g., enabling accurate multi-step rollouts over longer forecast horizons as well as sharper predictions through spectral loss adjustments. We also systematically explore a wide range of model and dataset sizes under various compute budgets to construct IsoFLOP curves, and we identify compute-optimal training regimes. Extrapolating these trends to larger scales highlights potential performance limits, demonstrating that neural scaling can serve as an important diagnostic for efficient resource allocation. We open-source our code for reproducibility.
Comments: ICLR Foundation Models for Science Workshop 2026, 19 pages, 13 figures
Subjects:
Machine Learning (cs.LG)
Cite as: arXiv:2603.25687 [cs.LG]
(or arXiv:2603.25687v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2603.25687
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Shashank Subramanian [view email] [v1] Thu, 26 Mar 2026 17:37:25 UTC (4,967 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv
Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics
arXiv:2604.00443v1 Announce Type: new Abstract: If the same neuron activates for both "lender" and "riverside," standard metrics attribute the overlap to superposition--the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as "bank") rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different word, same meaning) across models spanning 110M-70B parameters. The confound carries into sparse autoencoders (18-36% of features blend senses), sits in <=1% of activation dimensions, and hurts downstream tasks: filtering it out improves word sense di

The Rashomon Effect for Visualizing High-Dimensional Data
arXiv:2604.00485v1 Announce Type: new Abstract: Dimension reduction (DR) is inherently non-unique: multiple embeddings can preserve the structure of high-dimensional data equally well while differing in layout or geometry. In this paper, we formally define the Rashomon set for DR -- the collection of `good' embedding -- and show how embracing this multiplicity leads to more powerful and trustworthy representations. Specifically, we pursue three goals. First, we introduce PCA-informed alignment to steer embeddings toward principal components, making axes interpretable without distorting local neighborhoods. Second, we design concept-alignment regularization that aligns an embedding dimension with external knowledge, such as class labels or user-defined concepts. Third, we propose a method t

Neuropsychiatric Deviations From Normative Profiles: An MRI-Derived Marker for Early Alzheimer's Disease Detection
arXiv:2604.00545v1 Announce Type: new Abstract: Neuropsychiatric symptoms (NPS) such as depression and apathy are common in Alzheimer's disease (AD) and often precede cognitive decline. NPS assessments hold promise as early detection markers due to their correlation with disease progression and their non-invasive nature. Yet current tools cannot distinguish whether NPS are part of aging or early signs of AD, limiting their utility. We present a deep learning-based normative modelling framework to identify atypical NPS burden from structural MRI. A 3D convolutional neural network was trained on cognitively stable participants from the Alzheimer's Disease Neuroimaging Initiative, learning the mapping between brain anatomy and Neuropsychiatric Inventory Questionnaire (NPIQ) scores. Deviations
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers

GUIDE: Reinforcement Learning for Behavioral Action Support in Type 1 Diabetes
arXiv:2604.00385v1 Announce Type: new Abstract: Type 1 Diabetes (T1D) management requires continuous adjustment of insulin and lifestyle behaviors to maintain blood glucose within a safe target range. Although automated insulin delivery (AID) systems have improved glycemic outcomes, many patients still fail to achieve recommended clinical targets, warranting new approaches to improve glucose control in patients with T1D. While reinforcement learning (RL) has been utilized as a promising approach, current RL-based methods focus primarily on insulin-only treatment and do not provide behavioral recommendations for glucose control. To address this gap, we propose GUIDE, an RL-based decision-support framework designed to complement AID technologies by providing behavioral recommendations to pre

Beyond Symbolic Solving: Multi Chain-of-Thought Voting for Geometric Reasoning in Large Language Models
arXiv:2604.00890v1 Announce Type: new Abstract: Geometric Problem Solving (GPS) remains at the heart of enhancing mathematical reasoning in large language models because it requires the combination of diagrammatic understanding, symbolic manipulation and logical inference. In existing literature, researchers have chiefly focused on synchronising the diagram descriptions with text literals and solving the problem. In this vein, they have either taken a neural, symbolic or neuro-symbolic approach. But this solves only the first two of the requirements, namely diagrammatic understanding and symbolic manipulation, while leaving logical inference underdeveloped. The logical inference is often limited to one chain-of-thought (CoT). To address this weakness in hitherto existing models, this paper

Google research suggests encryption technique used by Bitcoin will be cracked by quantum computers around 2029 — search giant says quantum attacks need to be prepared for now
Google research suggests encryption technique used by Bitcoin will be cracked by quantum computers around 2029 — search giant says quantum attacks need to be prepared for now

ARGS: Auto-Regressive Gaussian Splatting via Parallel Progressive Next-Scale Prediction
arXiv:2604.00494v1 Announce Type: new Abstract: Auto-regressive frameworks for next-scale prediction of 2D images have demonstrated strong potential for producing diverse and sophisticated content by progressively refining a coarse input. However, extending this paradigm to 3D object generation remains largely unexplored. In this paper, we introduce auto-regressive Gaussian splatting (ARGS), a framework for making next-scale predictions in parallel for generation according to levels of detail. We propose a Gaussian simplification strategy and reverse the simplification to guide next-scale generation. Benefiting from the use of hierarchical trees, the generation process requires only \(\mathcal{O}(\log n)\) steps, where \(n\) is the number of points. Furthermore, we propose a tree-based tra
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!