trunk/947e68e925a7fa55b337df5b1de1196bdbd7f470: Implement missing methods in `ProcessGroupWrapper` (#178779)
<p>Most importantly <code>shutdown</code> is missing which in the case of the NCCL process group may lead to hangs on termination.</p> <p>See <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="4169069625" data-permission-text="Title is private" data-url="https://github.com/pytorch/pytorch/issues/178758" data-hovercard-type="issue" data-hovercard-url="/pytorch/pytorch/issues/178758/hovercard" href="https://github.com/pytorch/pytorch/issues/178758">#178758</a></p> <details><summary>Example reproducer:</summary> <div class="highlight highlight-source-python notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="import os os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL" import torch import torch.distributed as dist import torch.multip
Most importantly shutdown is missing which in the case of the NCCL process group may lead to hangs on termination.
See #178758
Example reproducer:
import os
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
WORLD_SIZE = 2
def init_distributed(port, rank, world_size, init_file):
try:
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = port
os.environ["LOCAL_RANK"] = str(rank)
os.environ["RANK"] = str(rank)
os.environ["LOCAL_SIZE"] = str(world_size)
os.environ["WORLD_SIZE"] = str(world_size)
dist.init_process_group(
backend="nccl",
init_method=f"file://{init_file}",
rank=rank,
world_size=world_size,
device_id=torch.device('cuda', rank)
)
dist.barrier()
print(f"[Rank {rank}] ready")
dist.destroy_process_group()
except Exception as e:
print(rank, "ERROR", e)
def test_main():
mp.set_start_method("forkserver", force=True)
pool = mp.Pool(processes=WORLD_SIZE)
results = pool.starmap_async(
init_distributed,
[('51200', rank, WORLD_SIZE, '/tmp/ptinit.file') for rank in range(WORLD_SIZE)],
)
results.wait()
pool.close()
pool.join()
print("Finished")
if __name__ == "__main__":
test_main()
import os
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
WORLD_SIZE = 2
def init_distributed(port, rank, world_size, init_file):
try:
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = port
os.environ["LOCAL_RANK"] = str(rank)
os.environ["RANK"] = str(rank)
os.environ["LOCAL_SIZE"] = str(world_size)
os.environ["WORLD_SIZE"] = str(world_size)
dist.init_process_group(
backend="nccl",
init_method=f"file://{init_file}",
rank=rank,
world_size=world_size,
device_id=torch.device('cuda', rank)
)
dist.barrier()
print(f"[Rank {rank}] ready")
dist.destroy_process_group()
except Exception as e:
print(rank, "ERROR", e)
def test_main():
mp.set_start_method("forkserver", force=True)
pool = mp.Pool(processes=WORLD_SIZE)
results = pool.starmap_async(
init_distributed,
[('51200', rank, WORLD_SIZE, '/tmp/ptinit.file') for rank in range(WORLD_SIZE)],
)
results.wait()
pool.close()
pool.join()
print("Finished")
if __name__ == "__main__":
test_main()
Note that this shows a related issue: The device passed to dist.init_process_group is not passed from ProcessGroupWrapper to the underlying process group. Hence it will try guessing and warn about it:
Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()
I don't see an obvious solution to that so left it for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/178779 Approved by: https://github.com/Skylion007`
Assets 2
PyTorch Releases
https://github.com/pytorch/pytorch/releases/tag/trunk%2F947e68e925a7fa55b337df5b1de1196bdbd7f470Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
globalgithubSlovak AI startup raises millions to scale globally and boost legal tech innovation - MSN
<a href="https://news.google.com/rss/articles/CBMi6AJBVV95cUxPemp3QzZuUkFzVkgzc01haVRlcFVwQnlRMUFHZTVoWkp6LWxYNmJna0xBNmxZR3hSeTZkSDdBZjBIN0FER0kxaUkwbW4tVGlxRlN1aVFMMU85ZFpwdGk3MDNyTzcxVkdmUFhNMnJuWmFzbU54WXpmNVNtQnBMZ2swY2ItdldFZkhpRzVxUTRJVW93OEJPa0ROTTJCQU5XcC14Yzc3cGpONTgtT1d2RUFxUTRjVHowYXVyU2ZYdFozbC1DNmRBT2lyLUpzZmRUZ0ZTTHU3NnM0MkFkVGpRZEw5cFp4RHVIcWx6SXdROVpfMjIwUy1fX0I4YTVBSjRYaEljMFV5V2UxZk1YalJrNDJUdDNnR3NIZGpzb0wwWnBmUk5LRXQ0RFlyaDJCbEk2d3A5SEc0VTVkY2V3MjdwTGZBTGc4OElxQkxsZkF6OEp0TGt2UGNSSjliRmZDX2FYMkJGSDRQeUFuTGc?oc=5" target="_blank">Slovak AI startup raises millions to scale globally and boost legal tech innovation</a> <font color="#6f6f6f">MSN</font>
AI Watch: Global regulatory tracker - Taiwan - White & Case LLP
<a href="https://news.google.com/rss/articles/CBMikgFBVV95cUxQTWo5VjlSbHhKcFp1LVJ0VlY1aUVjVXVXT1JtZHBVNVdDOUo2UlBfMkJ1aVJXVmdyOXUzN3JnRWpOT01kcnozSXhSSDU1aW5HUVZSWVRUYkpqdDNLZUduYTVKakRmRWY3MDdEUEViVV9RQ0JldlltdHg1OEUyUHM3Y3I1UVFlbjBBcUxPY0ZwejBsUQ?oc=5" target="_blank">AI Watch: Global regulatory tracker - Taiwan</a> <font color="#6f6f6f">White & Case LLP</font>
Exa Labs opens first Asia office in Singapore to advance AI search infrastructure - TNGlobal
<a href="https://news.google.com/rss/articles/CBMiuAFBVV95cUxPMlZyQXlROHV4TlRGNHVtVi16cFpGcjY0LVRYWTFNcEg5N2llNk84R3pXa0FXYXJ1bnVUVzRLZDYxSEFPRWtNS0lMLVFCWF94cUxkSjk5YXBUS2pHVkNQMzJOZ3JNRFk1SmxMdHBhcmdNalBralJVRW9QeVpvVml2Yy00a0tTdE1vSmlyanB3aDhUMWFZdml4T1VaTml4MG83YVFFaWlaRnJpV0ZGM3hCRWRRUWdjZm1s?oc=5" target="_blank">Exa Labs opens first Asia office in Singapore to advance AI search infrastructure</a> <font color="#6f6f6f">TNGlobal</font>
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Products
AI Emerges as Crucial Tool for Groups Seeking Justice for Syria War Crimes - WSJ
<a href="https://news.google.com/rss/articles/CBMixwNBVV95cUxOeEw4TUVnczN4SlRHTl9VLTJFeW1adFhZZGJBc0NOdzYwRXlPQ0U1OFlLRzkteGxhd1g1OGl4TEFwLTdwd0pDZzJ2M1p4VTJGQ2lfVlRUcGs5V0xFMGxXM3g0RjB0dTdGeDNObUtVLVVkRGc3THpUVzB3d2J1TmFKOHJZYUZwd0pXUXNoVGJjVHBORVl5Mm84c1IxdDVoRFcxbkpXU3dPVUtRelQwT3JoOFBVbk9EanpGejZxNC12N2MtdXBza2lIelJ6SjRWTHg2U0ZWNU54TFdLbGNtbV93T1Atd2VOcUFVY2p1cmQ3R3ZYNVpHNTFQS3pib1FvUkJSRWNxYXZxMWkyUkNtWmtLRzBYNTdUSmJQY0ZxWktCN2FlY2VMMVZfMEVpMk9QNm5MLThuUVhOX3ZjNjkwS1Q0Q0o4YmpNZGtWNWlLNmxleTZid3Q2U3o4SkRZM0ZCVTJwQ1RqcmFkYW5vWncxdEdDNnlPSXJFVkhBbEs2REpwb3JRblM4cC02bXZTMUJfenBCdE1PLVJ5UmtYLXRuRHhiMm9WMjBfLVFaeV9SdldBbDlGbUd2MmRnWHlfaXNOTUZUVVUxM0FqUQ?oc=5" target="_blank">AI Emerges as Crucial Tool for Groups Seeking Justice for Syria War Crimes</a> <font color="#6f6f6f">WSJ</font>
Scientists create smart synthetic skin that can hide images and change shape
Inspired by the shape-shifting skin of octopuses, Penn State researchers developed a smart hydrogel that can change appearance, texture, and shape on command. The material is programmed using a special printing technique that embeds digital instructions directly into the skin. Images and information can remain invisible until triggered by heat, liquids, or stretching.
[New Research] You need Slack to be an effective agent
Purchasesforce Superintelligence is excited to announce some new research. While we do not generally share research on LessWrong, this work was particularly influenced by prior work on LessWrong, so we found it appropriate to share back. As you know, Purchasesforce Superintelligence is a leading AI R&D laboratory. Recently, our research has focused on enhancing agentic capabilities. Here at Purchasesforce, we believe that autonomous AI agents, fully integrated into modern enterprise tools, will drive the future of enterprise operations. After reading the nascent literature on LessWrong describing the relationship between Slack and AI Agents, we were shocked by how closely it related with our own research directions. Of course, as the world's leading AI-first productivity platform, we have
LLM Cost Tracking and Spend Management for Engineering Teams
<p>Your team ships a feature using GPT-4, it works great in staging, and then production traffic hits. Suddenly you are burning through API credits faster than anyone expected. Multiply that across three providers, five teams, and a few hundred thousand requests per day. Good luck figuring out where the money went.</p> <p>We built <a href="https://git.new/bifrost" rel="noopener noreferrer">Bifrost</a>, an open-source LLM gateway in Go, and cost tracking was one of the first problems we had to solve properly. This post covers what we learned, how we designed spend management into the gateway layer, and what the alternatives look like. You can get started with the <a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer">setup guide</a> in under a minute.</
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!