Error in printing camera tensors (`CUDA error: an illegal memory access was encountered`)

xt02348 · January 12, 2021, 9:13pm

Hi,

I am playing around with the interop_torch.py example. The default script runs fine. But if I change the graphics device to a different GPU (instead of the default 0):

sim = gym.create_sim(0, 1, args.physics_engine, sim_params)

And then I run print(cam_tensors[0]) in the while loop, I got the cuda error (RuntimeError: CUDA error: an illegal memory access was encountered).

But I can run print(cam_tensors[0].cpu()) without issues.

I also attached my script here.

And the graphics_device in create_sim does not seem to respect the environment variable CUDA_VISIBLE_DEVICES. If I set CUDA_VISIBLE_DEVICES=1, then the camera tensors will still be on device cuda:1, but pytorch usually gives cuda:0 in this case.

gstate · January 14, 2021, 1:36am

Hi @xt02348,

We’ll have to look into this more closely. There are definitely a few places where we yet handling multi-GPU cases as well as we should be, and this may be one of them. I can’t reproduce this issue on a machine with a single GPU.

What are the two GPUs you have, btw? Are they both the same, or do you have two different architectures?

Take care,
-Gav

gstate · January 14, 2021, 1:48am

One further note about CUDA_VISIBLE_DEVICES - that controls available compute devices, but not graphics devices. The create_sim graphics device parameter uses enumerated Vulkan devices, which are not hidden by CUDA_VISIBLE_DEVICES.

If your compute device is on GPU 0 and you’re rendering on GPU 1 that could be a reason for the runtime error - the camera data isn’t on the GPU that PyTorch is expecting.

If you set CUDA_VISIBLE_DEVICES=1 and use GPU 1 for the graphics, do you still see the crash?

Take care,
-Gav

xt02348 · January 14, 2021, 2:45am

Hi Gav,

Thanks for looking into these issues. My machine has two same 2080Ti GPUs. So I tried using the second GPU as well for both the physics and graphics:

sim = gym.create_sim(1, 1, args.physics_engine, sim_params)

And it still gives the same error. It seems like I can only set graphics_device to be either 0 or -1. Other values will give cuda error.

And if I run the script with CUDA_VISIBLE_DEVICES=1 and use GPU 1 for graphics, I will get RuntimeError: CUDA error: invalid device ordinal in the line print(cam_tensors[0].cpu()). My guess is that PyTorch is expecting all the tensors to be on cuda:0 in this case as it does not see other GPUs. But Isaac Gym can still see other gpus and return the camera images on device cuda:1, which PyTorch does not recognize.

gstate · January 15, 2021, 1:44am

Hi @xt02348,

Something even stranger than that is happening. It works properly with the graphics device set to 0 and the compute device set to 1. Printing the camera tensor shows it’s on the cuda:0 device.

If you set the graphics device to 1, it works if you do a .clone().detach() on the camera tensor. You can happily move it to any device you want from there as well.

For now I’d suggest just keeping rendering on device 0. This one will likely need more time for us to track down.

Note that there are some places in the RL examples where cuda:0 is explicitly used. You may want to look at that more closely if you want to force training on another GPU.

Take care,
-Gav

xt02348 · January 15, 2021, 3:11am

Got it. Thanks!

Topic		Replies	Views
Invalid device ordinal error when running example code interop_torch.py with CUDA_VISIBLE_DEVICES set Isaac Gym	1	804	January 5, 2024
Create camera sensor fail on buffer Isaac Gym camera	5	2771	October 12, 2021
Pytorch is not detecting GPU More vGPU Forums cuda	0	2296	August 4, 2022
NVIDIA CUDA cannot work properly CUDA Programming and Performance cuda , python	0	497	November 17, 2022
CUDA 10.1.243 + tensorflow-gpu 2.3.0rc0 (CUDA runtime error: device kernel image invalid)) CUDA Developer Tools	1	2367	August 11, 2020
NVIDIA CUDA cannot work properly Computer Vision & Image Processing cuda , python	1	741	November 17, 2022
RuntimeError: CUDA error: no kernel image is available for execution on the device Linux cuda , ubuntu , pytorch	4	41187	September 6, 2021
Segmentation fault when using different GPUs Isaac Gym	5	2598	May 12, 2024
GPU errors during CUDA-based computations CUDA Programming and Performance cuda , pytorch , machine-learning	6	2096	May 8, 2023
GPU memory cannot be released Deep Learning (Training & Inference)	0	1342	October 26, 2018

Error in printing camera tensors (`CUDA error: an illegal memory access was encountered`)

Related topics