Invalid device ordinal error when running example code interop_torch.py with CUDA_VISIBLE_DEVICES set

To reproduce the error:

export CUDA_VISIBLE_DEVICES=2 # This is the same GPU device as my graphics device)
python interop_torch.py --sim_device=‘cuda:0’ --graphics_device_id=3 --headless

Std out and std err from running the command:

Importing module 'gym_38' (/home/cirrascale/Projects/isaacgym/python/isaacgym/_bindings/linux-x86_64/gym_38.so)
Setting GYM_USD_PLUG_INFO_PATH to /home/cirrascale/Projects/isaacgym/python/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json
PyTorch version 2.0.1
Device count 1
/home/cirrascale/Projects/isaacgym/python/isaacgym/_bindings/src/gymtorch
Using /home/cirrascale/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /home/cirrascale/.cache/torch_extensions/py38_cu117/gymtorch/build.ninja...
Building extension module gymtorch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Not connected to PVD
+++ Using GPU PhysX
Physics Engine: PhysX
Physics Device: cuda:0
GPU Pipeline: enabled
[Error] [carb.gym.plugin] Gym cuda error: invalid device ordinal: ../../../source/plugins/carb/gym/impl/Gym/GymPhysXCuda.cu: 926
[Error] [carb.gym.plugin] Failed to fill rigid body state tensor
Loading extension module gymtorch...
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Got camera tensor with shape (128, 128, 4)
  Torch camera tensor device: cuda:3
  Torch camera tensor shape: torch.Size([128, 128, 4])
Gym state tensor shape: (16, 13)
Gym state tensor data @ 0x7f2b41a00000
Torch state tensor device: cuda:0
Torch state tensor shape: torch.Size([16, 13])
Torch state tensor data @ 0x7f2b41a00000
========= Frame 0 ==========
RB positions:
tensor([[0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000],
        [0.0000, 4.9959, 0.0000]], device='cuda:0')
Traceback (most recent call last):
  File "interop_torch.py", line 196, in <module>
    cam_img = cam_tensors[i].cpu().numpy()
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Hi How did you resolve this invalid device ordinal error?
A similar error occurs when trying to run torchrun command to host llama2 locally