我搭建pytorch分布式训练环境是遇到以下两个问题,麻烦帮忙看一下,核心板是AGX Orin、jetpack版本是5.1.1:
AGX Orin上面的pytorch分布式训练框架是否有完整的环境搭建和使用示例说明?
我目前使用官方提供的pytorch包,安装后分布式框架torch.distributed
后不支持init_process_group
等函数,使用pip3 upgrade torch
命令更新pytorch后,就支持init_process_group
等函数了,但是更新后的pytorch又和cuda不匹配,是否有既支持init_process_group
、又和cuda匹配的pytorch安装包,或者说需要额外安装其他包才能支持init_process_group
?
Hi
1.
You can find below two documents for the info:
Or we also have containers with PyTorch pre-installed:
2.
By default, our prebuilt doesn’t build with torch.distributed
.
If you want the feature enabled, please build it from the source.
More details can be found in the below link:
@janelin_312 it’s not built with distributed enabled (it’s not a common use-case for Jetson) - you can rebuild PyTorch with USE_DISTRIBUTED=1 (after installing libopenmpi-dev). See the “Build from Source” part of this thread: PyTorch for Jetson
Thanks.
system
Closed
October 9, 2024, 6:22am
6
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.