Unable to perform inference with PyTorch through docker container nvcr.io/nvidia/pytorch:22.10-py3 on Jetson Orin

Good evening.
I’m using the docker container nvcr.io/nvidia/pytorch:22.10-py3 (PyTorch | NVIDIA NGC) on my just flashed Jetson Orin. For running the container I use the next arguments:

docker run -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --runtime nvidia nvcr.io/nvidia/pytorch:22.10-py3 bash

The container initializes correctly and recognizes the NVIDIA Tegra driver. Once there I open a python3 console, and import torch and torch_vision and try to load a pretrained model and do a sample forward pass through the next simple script:

import torch
from torchvision.models import resnet50, ResNet50_Weights

device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2).to(device)
my_input = torch.rand(1, 3, 224, 224).to(device)
model(my_input)

I can confirm that cuda is available (torch.cuda.is_available() returns true), but in the last line of the script I get the next error:

>>> model(my_input)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 285, in forward
    return self._forward_impl(x)
  File "/opt/conda/lib/python3.8/site-packages/torchvision/models/resnet.py", line 268, in _forward_impl
    x = self.conv1(x)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

I know that many times CUDNN_STATUS_INTERNAL_ERROR is due to lack of memory. If I run torch.cuda.get_device_properties(0).total_memory I get the 32Gb of my Orin’s DRAM.
Is there anything I’m doing wrong? Do I have to enable any other flag in the docker run instruction for the container to be available to access all the resources? May the error have any other cause?
Thank you very much

Regards,
Daniel

Hi @dgandiaga92, please use the l4t-pytorch container for Jetson instead:

Hi dusty_nv, thank you for your answer.
The thing is that I’m missing some libraries in l4t-pytorch:r35.1.0-pth1.13-py3 that were present in pytorch:22.10-py3. Specifically Torch-TensorRT (Torch-TensorRT — Torch-TensorRT master documentation), that I’m having problems to compile from source on my own on a l4t-pytorch container. Isn’t there any other alternative?

The Dockerfiles and build scripts to the l4t-pytorch containers are found here if you want to modify them and rebuild: https://github.com/dusty-nv/jetson-containers

You might find torch2trt similar and easier to install: https://github.com/NVIDIA-AI-IOT/torch2trt

Thanks, but what I’d need is the dockerfile and build script of pytorch:22.10-py3 so I can see how it installs torch_tensorrt and replicate it on a container based on l4t. Regarding torch2trt I’ve already tried it and the output of the compiled model was complete noise, in opposite of when using torch_tensorrt where the output was exactly the same than with the original model.

There are other NVIDIA dockerfiles here (https://gitlab.com/nvidia/container-images), but unfortunately those don’t seem to include the non-L4T pytorch container. I’m not familiar with installing Torch-TensorRT myself, so what I’d recommend is posting the errors you are getting building it to the Torch-TensorRT GitHub repo here: https://github.com/pytorch/TensorRT/issues

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.