Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessHow Bifrost Reduces GPT Costs and Response Times with Semantic CachingDEV Community[New Research] You need Slack to be an effective agentLessWrong AIAn interview with Galen Buckwalter, a BCI recipient in a Caltech brain implant study, on his recent ability to use the implant to produce musical tones (Emily Mullin/Wired)TechmemeTop 5 Enterprise AI Gateways to Track Claude Code CostsDEV CommunityAntigravity: My Approach to Deliver the Most Assured Value for the Least MoneyDEV CommunityTrading My Body for Logic: The Physical Decay We IgnoreDEV CommunityGetting Started with Apache Kafka: What I Learned Building Event-Driven Microservices at EricssonDEV CommunityTop 5 Enterprise AI Gateways to Reduce LLM Cost and LatencyDEV CommunityLLM Cost Tracking and Spend Management for Engineering TeamsDEV CommunityWhat Does It Take to Keep an AI Alive?DEV CommunityThe Role of AI in Today's Business LandscapeDEV CommunityWho am I? The Guardian of the Siuntio FortDEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessHow Bifrost Reduces GPT Costs and Response Times with Semantic CachingDEV Community[New Research] You need Slack to be an effective agentLessWrong AIAn interview with Galen Buckwalter, a BCI recipient in a Caltech brain implant study, on his recent ability to use the implant to produce musical tones (Emily Mullin/Wired)TechmemeTop 5 Enterprise AI Gateways to Track Claude Code CostsDEV CommunityAntigravity: My Approach to Deliver the Most Assured Value for the Least MoneyDEV CommunityTrading My Body for Logic: The Physical Decay We IgnoreDEV CommunityGetting Started with Apache Kafka: What I Learned Building Event-Driven Microservices at EricssonDEV CommunityTop 5 Enterprise AI Gateways to Reduce LLM Cost and LatencyDEV CommunityLLM Cost Tracking and Spend Management for Engineering TeamsDEV CommunityWhat Does It Take to Keep an AI Alive?DEV CommunityThe Role of AI in Today's Business LandscapeDEV CommunityWho am I? The Guardian of the Siuntio FortDEV Community

trunk/947e68e925a7fa55b337df5b1de1196bdbd7f470: Implement missing methods in `ProcessGroupWrapper` (#178779)

PyTorch Releasesby FlamefireMarch 31, 20263 min read0 views
Source Quiz

<p>Most importantly <code>shutdown</code> is missing which in the case of the NCCL process group may lead to hangs on termination.</p> <p>See <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="4169069625" data-permission-text="Title is private" data-url="https://github.com/pytorch/pytorch/issues/178758" data-hovercard-type="issue" data-hovercard-url="/pytorch/pytorch/issues/178758/hovercard" href="https://github.com/pytorch/pytorch/issues/178758">#178758</a></p> <details><summary>Example reproducer:</summary> <div class="highlight highlight-source-python notranslate position-relative overflow-auto" data-snippet-clipboard-copy-content="import os os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL" import torch import torch.distributed as dist import torch.multip

Most importantly shutdown is missing which in the case of the NCCL process group may lead to hangs on termination.

See #178758

Example reproducer:

python
import os
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"

import torch
import torch.distributed as dist
import torch.multiprocessing as mp

WORLD_SIZE = 2

def init_distributed(port, rank, world_size, init_file):
 try:
 os.environ["MASTER_ADDR"] = "127.0.0.1"
 os.environ["MASTER_PORT"] = port
 os.environ["LOCAL_RANK"] = str(rank)
 os.environ["RANK"] = str(rank)
 os.environ["LOCAL_SIZE"] = str(world_size)
 os.environ["WORLD_SIZE"] = str(world_size)

 dist.init_process_group(
 backend="nccl",
 init_method=f"file://{init_file}",
 rank=rank,
 world_size=world_size,
 device_id=torch.device('cuda', rank)
 )
 dist.barrier()
 print(f"[Rank {rank}] ready")
 dist.destroy_process_group()
 except Exception as e:
 print(rank, "ERROR", e)

def test_main():
 mp.set_start_method("forkserver", force=True)
 pool = mp.Pool(processes=WORLD_SIZE)
 results = pool.starmap_async(
 init_distributed,
 [('51200', rank, WORLD_SIZE, '/tmp/ptinit.file') for rank in range(WORLD_SIZE)],
 )
 results.wait()
 pool.close()
 pool.join()
 print("Finished")

if __name__ == "__main__":
 test_main()

Note that this shows a related issue: The device passed to dist.init_process_group is not passed from ProcessGroupWrapper to the underlying process group. Hence it will try guessing and warn about it:

Guessing device ID based on global rank. This can cause a hang if rank to GPU mapping is heterogeneous. You can specify device_id in init_process_group()

I don't see an obvious solution to that so left it for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/178779 Approved by: https://github.com/Skylion007`

Assets 2

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

globalgithub

Knowledge Map

Knowledge Map
TopicsEntitiesSource
trunk/947e6…globalgithubPyTorch Rel…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 123 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Products