Jetpack 5.1.2 Model Training Fails with Torch 2.1.0

Hello,

We are trying to run a simple MNIST training experiment on Xavier Dev Board with Jetpack 5.1.2, CUDA 11.4 and Torch 2.1.0.
We use the torch .whl file PyTorch for Jetson for JetPack5, Pytorch 2.1.0.
Since no torchvision .whl is provided we build it from source according to jetson-containers/packages/pytorch/torchvision at master · dusty-nv/jetson-containers · GitHub.
We add torch and torchvison using the .whl files to our environment and try to run a simple MNIST training experiment.
But at the start of the training we get the error below:

File "/mnt/nvme/.venvs/venv3_8/lib/python3.8/site-packages/torch/autograd/__init__.py", line 204, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Event device type CUDA does not match blocking stream's device type CPU.

When we apply the same procedure for an Xavier with Jetpack 5.1.2 and CUDA 12.2 and use the same training scripts the training does not fail.

Is this a known issue?

Hi,

Which TorchVision version do you use?
Suppose it should be v0.16.1.

Thanks.

Hello,

Thank you for the answer.
In jetson-containers github torchvision 0.16.2 is mentioned for Pytorch 2.1.
That is why we build using torchvision 0.16.2.

Hi,

Could you try the below container?
The container has PyTorch, TorchVision, and TorchAudio preinstalled.
So it’s expected to be compatible.

Thanks.

Hello

I do not see any container for Pytorch 2.1.0.
The newest version is Pytorch 2.0.
Is there any problem to build or run Pytorch 2.1.0 with Jetpack 5.1.2?

Hello

When we investigate Torch build page in Jetson containers we observe that for L4T 35 we have Pytorch 2.1.0 is supported.

And when we investigate Torchvision build page in Jetson containers we observe that for L4T 35 we have Torchvision 0.16.2 is supported.

We build both .whl files. But we get errors during training.
Can we use Pytorch 2.1.0 and Torchvision 0.16.2 for model training in Xavier/Orin development boards with Jetpack 5.1.2?

Hi,

According to this:

RuntimeError: Event device type CUDA does not match blocking stream's device type CPU.

It might not relate to the compatibility between PyTorch and TorchVision.
Could you share how do you apply the training?

Please try to add the below line into your script to see if it helps.

import pycuda.autoinit

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.