Pytocrh环境搭建

我搭建pytorch分布式训练环境是遇到以下两个问题,麻烦帮忙看一下,核心板是AGX Orin、jetpack版本是5.1.1:

  1. AGX Orin上面的pytorch分布式训练框架是否有完整的环境搭建和使用示例说明?
  2. 我目前使用官方提供的pytorch包,安装后分布式框架torch.distributed后不支持init_process_group等函数,使用pip3 upgrade torch命令更新pytorch后,就支持init_process_group等函数了,但是更新后的pytorch又和cuda不匹配,是否有既支持init_process_group、又和cuda匹配的pytorch安装包,或者说需要额外安装其他包才能支持init_process_group

Hi

1.
You can find below two documents for the info:

Or we also have containers with PyTorch pre-installed:

2.
By default, our prebuilt doesn’t build with torch.distributed.
If you want the feature enabled, please build it from the source.
More details can be found in the below link:

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.