Hi, I’m using a jetson AGX Orin device and I have torch==2.0.0+nv23.05 installed and using the 11.4 version of cuda. I’m using this version because it fits my device just right and uses the GPU on my device. I’m now trying to use the RPC in Torch to communicate between devices, but I’m getting back a torch.distributed.is_available() with false. I want to continue using it under the current torch version, is there any way to fix this bug? Any suggestions will be help!
Hi,
Here are some suggestions for the common issues:
1. Performance
Please run the below command before benchmarking deep learning use case:
$ sudo nvpmodel -m 0
$ sudo jetson_clocks
2. Installation
Installation guide of deep learning frameworks on Jetson:
- TensorFlow: https://docs.nvidia.com/deeplearning/frameworks/install-tf-jetson-platform/index.html
- PyTorch: Installing PyTorch for Jetson Platform - NVIDIA Docs
We also have containers that have frameworks preinstalled:
Data Science, Machine Learning, AI, HPC Containers | NVIDIA NGC
3. Tutorial
Startup deep learning tutorial:
- Jetson-inference: Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson
- TensorRT sample: Jetson/L4T/TRT Customized Example - eLinux.org
4. Report issue
If these suggestions don’t help and you want to report an issue to us, please attach the model, command/step, and the customized app (if any) with us to reproduce locally.
Thanks!
import os
import torch
import torch.distributed.rpc as rpc
import sys
os.environ[‘MASTER_ADDR’] = ‘192.168.1.101’
os.environ[‘MASTER_PORT’] = ‘29500’
def double_result_on_device_b(x):
return x * 2
if name == “main”:
device = sys.argv[1]
rank = 0 if device == “a” else 1
rpc.init_rpc(
device,
rank=rank,
world_size=2,
rpc_backend_options=rpc.TensorPipeRpcBackendOptions()
)
if device == "a":
a = 3
b = 4
result = a + b
fut = rpc.rpc_async("deviceB", double_result_on_device_b, args=(result,))
print(f"answer:{fut.wait()}")
rpc.shutdown()
Here’s my rpc code, I used Jetson AGX Orin 64GB RAM and Jetson Xavier NX 16GB RAM for the experiment. I used this code to run Python rpc.py A and Python rpc.py B instructions on both devices, and the result was the following error:
Traceback (most recent call last):
File “rpc.py”, line 18, in
rpc.init_rpc(
AttributeError: module ‘torch.distributed.rpc’ has no attribute ‘init_rpc’
Hi,
Could you try if the module exists in the package listed in the below link:
http://jetson.webredirect.org/jp6/cu126
Thanks.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.