Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessYour DNS is Lying to YouDEV CommunityYour Process Doesn't Exist AloneDEV CommunityClaude Code Source Leaked: 5 Hidden Features Found in 510K Lines of CodeDEV CommunityOpenAI Just Shipped a Plugin So Codex Runs Inside Claude CodeDEV CommunityThe Parallel Lanes Nobody UsesDEV CommunityCodiumAI Alternatives: Best AI Testing ToolsDEV CommunityAGI CPU: Arm’s $100B AI Silicon Tightrope Walk Without Undermining Its LicenseesEE TimesFile Descriptors: The Numbers Behind EverythingDEV CommunityYour String is Not What You Think It IsDEV CommunityWelcome to Transitive Dependency HellDEV CommunityWhat Happens When You Press a KeyDEV Communityv1.83.0-nightlyLiteLLM ReleasesBlack Hat USADark ReadingBlack Hat AsiaAI BusinessYour DNS is Lying to YouDEV CommunityYour Process Doesn't Exist AloneDEV CommunityClaude Code Source Leaked: 5 Hidden Features Found in 510K Lines of CodeDEV CommunityOpenAI Just Shipped a Plugin So Codex Runs Inside Claude CodeDEV CommunityThe Parallel Lanes Nobody UsesDEV CommunityCodiumAI Alternatives: Best AI Testing ToolsDEV CommunityAGI CPU: Arm’s $100B AI Silicon Tightrope Walk Without Undermining Its LicenseesEE TimesFile Descriptors: The Numbers Behind EverythingDEV CommunityYour String is Not What You Think It IsDEV CommunityWelcome to Transitive Dependency HellDEV CommunityWhat Happens When You Press a KeyDEV Communityv1.83.0-nightlyLiteLLM Releases

Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2506.23836v2 Announce Type: replace-cross Abstract: We consider centralized distributed optimization in the classical federated learning setup, where $n$ workers jointly find an $\varepsilon$-stationary point of an $L$-smooth, $d$-dimensional nonconvex function $f$, having access only to unbiased stochastic gradients with variance $\sigma^2$. Each worker requires at most $h$ seconds to compute a stochastic gradient, and the communication times from the server to the workers and from the workers to the server are $\tau_{s}$ and $\tau_{w}$ seconds per coordinate, respectively. One of the m — Alexander Tyurin

View PDF HTML (experimental)

Abstract:We consider centralized distributed optimization in the classical federated learning setup, where $n$ workers jointly find an $\varepsilon$-stationary point of an $L$-smooth, $d$-dimensional nonconvex function $f$, having access only to unbiased stochastic gradients with variance $\sigma^2$. Each worker requires at most $h$ seconds to compute a stochastic gradient, and the communication times from the server to the workers and from the workers to the server are $\tau_{s}$ and $\tau_{w}$ seconds per coordinate, respectively. One of the main motivations for distributed optimization is to achieve scalability with respect to $n$. For instance, it is well known that the distributed version of SGD has a variance-dependent runtime term $\frac{h \sigma^2 L \Delta}{n \varepsilon^2},$ which improves with the number of workers $n,$ where $\Delta = f(x^0) - f^,$ and $x^0 \in R^d$ is the starting point. Similarly, using unbiased sparsification compressors, it is possible to reduce both the variance-dependent runtime term and the communication runtime term. However, once we account for the communication from the server to the workers $\tau_{s}$, we prove that it becomes infeasible to design a method using unbiased random sparsification compressors that scales both the server-side communication runtime term $\tau_{s} d \frac{L \Delta}{\varepsilon}$ and the variance-dependent runtime term $\frac{h \sigma^2 L \Delta}{\varepsilon^2},$ better than poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same distribution. To establish this result, we construct a new "worst-case" function and develop a new lower bound framework that reduces the analysis to the concentration of a random sum, for which we prove a concentration bound. These results reveal fundamental limitations in scaling distributed optimization, even under the homogeneous assumption.

Subjects:

Optimization and Control (math.OC); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)

Cite as: arXiv:2506.23836 [math.OC]

(or arXiv:2506.23836v2 [math.OC] for this version)

https://doi.org/10.48550/arXiv.2506.23836

arXiv-issued DOI via DataCite

Submission history

From: Alexander Tyurin [view email] [v1] Mon, 30 Jun 2025 13:27:39 UTC (233 KB) [v2] Sat, 28 Mar 2026 13:27:51 UTC (261 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Proving the…researchpaperarxivmachine-lea…deep-learni…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 159 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers