Issue running TensorRT Demos on Clara AGX within Docker PyTorch Container

I am using a Clara AGX developer kit, and I am trying to run a TensorRT demo - specifically, the diffusion demo at this link: TensorRT/demo/Diffusion at main · NVIDIA/TensorRT · GitHub.

I am launching the NGC container using docker as instructed in the Git.
However, when I try to build the plugin libraries—specifically, running: make -j$(nproc), I run into the error:
fatal error: cuda_runtime_api.h: No such file or directory

Cuda seems to work, as the nvidia-smi output is as expected:

I’m not sure how to resolve this issue. For additional context, I setup the Clara AGX using the SDK Manager.

Hello there! Sorry for the late reply.

Are you able to find cuda_runtime_api.h on bare metal?

I just tried to follow the steps in TensorRT/demo/Diffusion at main · NVIDIA/TensorRT · GitHub skipping the part " (Optional) Install latest TensorRT release", and could finish running make -j$(nproc) without the error.
I find the file on bare metal at: /usr/local/cuda-11.6/targets/sbsa-linux/include/cuda_runtime_api.h on the Clara AGX devkit, and at /usr/local/cuda-11.8/targets/sbsa-linux/include/cuda_runtime_api.h within the launched container.

No worries—not sure what the issue was earlier, but I followed the steps again and I am able to build successfully. I am able to locate the necessary cuda files on bare metal.

However, upon trying to launch the model (>python3 --help) after building, I’ll receive the error: “ModuleNotFoundError: No module named ‘cuda’”, so it seems a cuda issue persists.

Good to hear you got past the initial issue! I would suggest to raise the issue on one of the CUDA forums CUDA - NVIDIA Developer Forums

Thanks, I’ll raise an issue there.
Prior to launching the model, I’m getting the following error when trying to install requirements.txt in the container:

ERROR: Could not find a version that satisfies the requirement torch==1.12.1+cu116 (from versions: 1.8.0, 1.8.1, 1.9.0, 1.10.0, 1.10.1, 1.10.2, 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1)
ERROR: No matching distribution found for torch==1.12.1+cu116

I’ve tried installing this requirement directly as instructed by the PyTorch website in the container, like so:

pip3 install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url

However, I get the same ‘No matching distribution error’. I’ll raise an issue on the CUDA forum about this, but just thought I’d mention this here. Thanks!

Oh I see, the torch installation on arm+dGPU could be a little tricky, I will ask around and see if someone has the install recipe. In the meanwhile, one thing that you could try is using the PyTorch base image instead of the TRT base image The PyT base image supports both x86 and arm64, you could see the details here PyTorch | NVIDIA NGC

@rchand18 Following up on the previous message: the existing pip wheels would not work for the Clara AGX devkit, since their arm builds support Mac M1/M2. For using Pytorch on the devkit, you could build PyTorch from source, or use the NGC PyT container.

I tried following the same steps but in the PyTorch base image, but i still get the same

ERROR: No matching distribution found for torch==1.12.1+cu116

issue. I’ll try building from source and see if it yields different results.

That error is likely because the 22.10 image has a higher CUDA/Pytorch version than 11.6/1.12.1. Please see PyTorch Release 22.10 for the software versions in each Docker image release. Perhaps you could try an earlier version of the PyTorch Docker image although that may require you to use an earlier version of the TensorRT Diffusion demo repo. Building from source can be a good option.