trunk/947e68e925a7fa55b337df5b1de1196bdbd7f470: Implement missing methods in `ProcessGroupWrapper` (#178779)

PyTorch Releasesby FlamefireMarch 31, 20263 min read0 views

Most importantly <code>shutdown</code> is missing which in the case of the NCCL process group may lead to hangs on termination. See <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="4169069625" data-permission-text="Title is private" data-url="https://github.com/pytorch/pytorch/issues/178758" data-hovercard-type="issue" data-hovercard-url="/pytorch/pytorch/issues/178758/hovercard" href="https://github.com/pytorch/pytorch/issues/178758">#178758</a> <details><summary>Example reproducer:</summary> <div class="highlight highlight-source-python notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="import os os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL" import torch import torch.distributed as dist import torch.multip

Most importantly shutdown is missing which in the case of the NCCL process group may lead to hangs on termination.

See #178758

Example reproducer:

python

import os
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

WORLD_SIZE = 2

def init_distributed(port, rank, world_size, init_file):
 try:
 os.environ["MASTER_ADDR"] = "127.0.0.1"
 os.environ["MASTER_PORT"] = port
 os.environ["LOCAL_RANK"] = str(rank)
 os.environ["RANK"] = str(rank)
 os.environ["LOCAL_SIZE"] = str(world_size)
 os.environ["WORLD_SIZE"] = str(world_size)

 dist.init_process_group(
 backend="nccl",
 init_method=f"file://{init_file}",
 rank=rank,
 world_size=world_size,
 device_id=torch.device('cuda', rank)
 )
 dist.barrier()
 print(f"[Rank {rank}] ready")
 dist.destroy_process_group()
 except Exception as e:
 print(rank, "ERROR", e)

def test_main():
 mp.set_start_method("forkserver", force=True)
 pool = mp.Pool(processes=WORLD_SIZE)
 results = pool.starmap_async(
 init_distributed,
 [('51200', rank, WORLD_SIZE, '/tmp/ptinit.file') for rank in range(WORLD_SIZE)],
 )
 results.wait()
 pool.close()
 pool.join()
 print("Finished")

if __name__ == "__main__":
 test_main()

import os
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

WORLD_SIZE = 2

def init_distributed(port, rank, world_size, init_file):
 try:
 os.environ["MASTER_ADDR"] = "127.0.0.1"
 os.environ["MASTER_PORT"] = port
 os.environ["LOCAL_RANK"] = str(rank)
 os.environ["RANK"] = str(rank)
 os.environ["LOCAL_SIZE"] = str(world_size)
 os.environ["WORLD_SIZE"] = str(world_size)

 dist.init_process_group(
 backend="nccl",
 init_method=f"file://{init_file}",
 rank=rank,
 world_size=world_size,
 device_id=torch.device('cuda', rank)
 )
 dist.barrier()
 print(f"[Rank {rank}] ready")
 dist.destroy_process_group()
 except Exception as e:
 print(rank, "ERROR", e)

def test_main():
 mp.set_start_method("forkserver", force=True)
 pool = mp.Pool(processes=WORLD_SIZE)
 results = pool.starmap_async(
 init_distributed,
 [('51200', rank, WORLD_SIZE, '/tmp/ptinit.file') for rank in range(WORLD_SIZE)],
 )
 results.wait()
 pool.close()
 pool.join()
 print("Finished")

if __name__ == "__main__":
 test_main()

Note that this shows a related issue: The device passed to dist.init_process_group is not passed from ProcessGroupWrapper to the underlying process group. Hence it will try guessing and warn about it:

Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()

I don't see an obvious solution to that so left it for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/178779 Approved by: https://github.com/Skylion007`

Assets 2

Original source

PyTorch Releases

https://github.com/pytorch/pytorch/releases/tag/trunk%2F947e68e925a7fa55b337df5b1de1196bdbd7f470

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

globalgithub

CountriesRecent

Slovak AI startup raises millions to scale globally and boost legal tech innovation - MSN

<a href="https://news.google.com/rss/articles/CBMi6AJBVV95cUxPemp3QzZuUkFzVkgzc01haVRlcFVwQnlRMUFHZTVoWkp6LWxYNmJna0xBNmxZR3hSeTZkSDdBZjBIN0FER0kxaUkwbW4tVGlxRlN1aVFMMU85ZFpwdGk3MDNyTzcxVkdmUFhNMnJuWmFzbU54WXpmNVNtQnBMZ2swY2ItdldFZkhpRzVxUTRJVW93OEJPa0ROTTJCQU5XcC14Yzc3cGpONTgtT1d2RUFxUTRjVHowYXVyU2ZYdFozbC1DNmRBT2lyLUpzZmRUZ0ZTTHU3NnM0MkFkVGpRZEw5cFp4RHVIcWx6SXdROVpfMjIwUy1fX0I4YTVBSjRYaEljMFV5V2UxZk1YalJrNDJUdDNnR3NIZGpzb0wwWnBmUk5LRXQ0RFlyaDJCbEk2d3A5SEc0VTVkY2V3MjdwTGZBTGc4OElxQkxsZkF6OEp0TGt2UGNSSjliRmZDX2FYMkJGSDRQeUFuTGc?oc=5" target="_blank">Slovak AI startup raises millions to scale globally and boost legal tech innovation</a> MSN

Google News - AI Slovakia

1mabout 20 hours ago

Countries

AI Watch: Global regulatory tracker - Taiwan - White & Case LLP

<a href="https://news.google.com/rss/articles/CBMikgFBVV95cUxQTWo5VjlSbHhKcFp1LVJ0VlY1aUVjVXVXT1JtZHBVNVdDOUo2UlBfMkJ1aVJXVmdyOXUzN3JnRWpOT01kcnozSXhSSDU1aW5HUVZSWVRUYkpqdDNLZUduYTVKakRmRWY3MDdEUEViVV9RQ0JldlltdHg1OEUyUHM3Y3I1UVFlbjBBcUxPY0ZwejBsUQ?oc=5" target="_blank">AI Watch: Global regulatory tracker - Taiwan</a> White & Case LLP

GNews AI Taiwan

1mabout 1 month ago

CountriesFresh

Exa Labs opens first Asia office in Singapore to advance AI search infrastructure - TNGlobal

<a href="https://news.google.com/rss/articles/CBMiuAFBVV95cUxPMlZyQXlROHV4TlRGNHVtVi16cFpGcjY0LVRYWTFNcEg5N2llNk84R3pXa0FXYXJ1bnVUVzRLZDYxSEFPRWtNS0lMLVFCWF94cUxkSjk5YXBUS2pHVkNQMzJOZ3JNRFk1SmxMdHBhcmdNalBralJVRW9QeVpvVml2Yy00a0tTdE1vSmlyanB3aDhUMWFZdml4T1VaTml4MG83YVFFaWlaRnJpV0ZGM3hCRWRRUWdjZm1s?oc=5" target="_blank">Exa Labs opens first Asia office in Singapore to advance AI search infrastructure</a> TNGlobal

GNews AI Singapore

1mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 123 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Products

Products

AI Emerges as Crucial Tool for Groups Seeking Justice for Syria War Crimes - WSJ

<a href="https://news.google.com/rss/articles/CBMixwNBVV95cUxOeEw4TUVnczN4SlRHTl9VLTJFeW1adFhZZGJBc0NOdzYwRXlPQ0U1OFlLRzkteGxhd1g1OGl4TEFwLTdwd0pDZzJ2M1p4VTJGQ2lfVlRUcGs5V0xFMGxXM3g0RjB0dTdGeDNObUtVLVVkRGc3THpUVzB3d2J1TmFKOHJZYUZwd0pXUXNoVGJjVHBORVl5Mm84c1IxdDVoRFcxbkpXU3dPVUtRelQwT3JoOFBVbk9EanpGejZxNC12N2MtdXBza2lIelJ6SjRWTHg2U0ZWNU54TFdLbGNtbV93T1Atd2VOcUFVY2p1cmQ3R3ZYNVpHNTFQS3pib1FvUkJSRWNxYXZxMWkyUkNtWmtLRzBYNTdUSmJQY0ZxWktCN2FlY2VMMVZfMEVpMk9QNm5MLThuUVhOX3ZjNjkwS1Q0Q0o4YmpNZGtWNWlLNmxleTZid3Q2U3o4SkRZM0ZCVTJwQ1RqcmFkYW5vWncxdEdDNnlPSXJFVkhBbEs2REpwb3JRblM4cC02bXZTMUJfenBCdE1PLVJ5UmtYLXRuRHhiMm9WMjBfLVFaeV9SdldBbDlGbUd2MmRnWHlfaXNOTUZUVVUxM0FqUQ?oc=5" target="_blank">AI Emerges as Crucial Tool for Groups Seeking Justice for Syria War Crimes</a> WSJ

Google News - AI Syria

1mabout 5 years ago

Products

Scientists create smart synthetic skin that can hide images and change shape

Inspired by the shape-shifting skin of octopuses, Penn State researchers developed a smart hydrogel that can change appearance, texture, and shape on command. The material is programmed using a special printing technique that embeds digital instructions directly into the skin. Images and information can remain invisible until triggered by heat, liquids, or stretching.

ScienceDaily AI

1mabout 2 months ago

ProductsLive

[New Research] You need Slack to be an effective agent

Purchasesforce Superintelligence is excited to announce some new research. While we do not generally share research on LessWrong, this work was particularly influenced by prior work on LessWrong, so we found it appropriate to share back. As you know, Purchasesforce Superintelligence is a leading AI R&D laboratory. Recently, our research has focused on enhancing agentic capabilities. Here at Purchasesforce, we believe that autonomous AI agents, fully integrated into modern enterprise tools, will drive the future of enterprise operations. After reading the nascent literature on LessWrong describing the relationship between Slack and AI Agents, we were shocked by how closely it related with our own research directions. Of course, as the world's leading AI-first productivity platform, we have

LessWrong AI

2m14 minutes ago

ProductsLive

LLM Cost Tracking and Spend Management for Engineering Teams

Your team ships a feature using GPT-4, it works great in staging, and then production traffic hits. Suddenly you are burning through API credits faster than anyone expected. Multiply that across three providers, five teams, and a few hundred thousand requests per day. Good luck figuring out where the money went. We built <a href="https://git.new/bifrost" rel="noopener noreferrer">Bifrost</a>, an open-source LLM gateway in Go, and cost tracking was one of the first problems we had to solve properly. This post covers what we learned, how we designed spend management into the gateway layer, and what the alternatives look like. You can get started with the <a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer">setup guide</a> in under a minute.</

DEV Community

11m22 minutes ago