Error in printing camera tensors (`CUDA error: an illegal memory access was encountered`)

Hi,

I am playing around with the interop_torch.py example. The default script runs fine. But if I change the graphics device to a different GPU (instead of the default 0):

sim = gym.create_sim(0, 1, args.physics_engine, sim_params)

And then I run print(cam_tensors[0]) in the while loop, I got the cuda error (RuntimeError: CUDA error: an illegal memory access was encountered).

But I can run print(cam_tensors[0].cpu()) without issues.

I also attached my script here.

And the graphics_device in create_sim does not seem to respect the environment variable CUDA_VISIBLE_DEVICES. If I set CUDA_VISIBLE_DEVICES=1, then the camera tensors will still be on device cuda:1, but pytorch usually gives cuda:0 in this case.

Hi @taocc,

We’ll have to look into this more closely. There are definitely a few places where we yet handling multi-GPU cases as well as we should be, and this may be one of them. I can’t reproduce this issue on a machine with a single GPU.

What are the two GPUs you have, btw? Are they both the same, or do you have two different architectures?

Take care,
-Gav

One further note about CUDA_VISIBLE_DEVICES - that controls available compute devices, but not graphics devices. The create_sim graphics device parameter uses enumerated Vulkan devices, which are not hidden by CUDA_VISIBLE_DEVICES.

If your compute device is on GPU 0 and you’re rendering on GPU 1 that could be a reason for the runtime error - the camera data isn’t on the GPU that PyTorch is expecting.

If you set CUDA_VISIBLE_DEVICES=1 and use GPU 1 for the graphics, do you still see the crash?

Take care,
-Gav

Hi Gav,

Thanks for looking into these issues. My machine has two same 2080Ti GPUs. So I tried using the second GPU as well for both the physics and graphics:

sim = gym.create_sim(1, 1, args.physics_engine, sim_params)

And it still gives the same error. It seems like I can only set graphics_device to be either 0 or -1. Other values will give cuda error.

And if I run the script with CUDA_VISIBLE_DEVICES=1 and use GPU 1 for graphics, I will get RuntimeError: CUDA error: invalid device ordinal in the line print(cam_tensors[0].cpu()). My guess is that PyTorch is expecting all the tensors to be on cuda:0 in this case as it does not see other GPUs. But Isaac Gym can still see other gpus and return the camera images on device cuda:1, which PyTorch does not recognize.

Hi @taocc,

Something even stranger than that is happening. It works properly with the graphics device set to 0 and the compute device set to 1. Printing the camera tensor shows it’s on the cuda:0 device.

If you set the graphics device to 1, it works if you do a .clone().detach() on the camera tensor. You can happily move it to any device you want from there as well.

For now I’d suggest just keeping rendering on device 0. This one will likely need more time for us to track down.

Note that there are some places in the RL examples where cuda:0 is explicitly used. You may want to look at that more closely if you want to force training on another GPU.

Take care,
-Gav

Got it. Thanks!