Maximize the GPU resources when using repo OmniIsaacGymEnvs

Hi,

I’m playing with the repo OmniIsaacGymEnvs now and trying to increase the number of instances in a single environment. My GPU is RTX A5000, 24GB memory.

For the built-in tasks like FrankaCabinet, I set the minibatch_size as twice as the number of instances, and the largest number of instances I can go is 8096. The GPU performance is like this :

When I increased the number to 1.5*8196=12294, I got an error like this:

2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] PhysX error: GPU integrateCoreParallel fail to launch kernel!!
, FILE /buildAgent/work/16dcef52b68a730f/source/gpusolver/src/PxgTGSCudaSolverCore.cpp, LINE 2393
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] PhysX error: SynchronizeStreams cuEventRecord failed with error 700
, FILE /buildAgent/work/16dcef52b68a730f/source/gpucommon/include/PxgCudaUtils.h, LINE 75
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] PhysX error: SynchronizeStreams cuStreamWaitEvent failed with error 700
, FILE /buildAgent/work/16dcef52b68a730f/source/gpucommon/include/PxgCudaUtils.h, LINE 81
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] PhysX error: GPU kernel 'markAggregateBoundsUpdated' failed to launch!!
, FILE /buildAgent/work/16dcef52b68a730f/source/gpubroadphase/src/PxgAABBManager.cpp, LINE 1206
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] PhysX error: SynchronizeStreams cuEventRecord failed with error 700
, FILE /buildAgent/work/16dcef52b68a730f/source/gpucommon/include/PxgCudaUtils.h, LINE 75
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] PhysX error: SynchronizeStreams cuStreamWaitEvent failed with error 700
, FILE /buildAgent/work/16dcef52b68a730f/source/gpucommon/include/PxgCudaUtils.h, LINE 81
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:04:35 [109,310ms] [Error] [omni.physx.plugin] PhysX error: PhysX Internal CUDA error. Simulation can not continue!, FILE /buildAgent/work/16dcef52b68a730f/source/physx/src/NpSceneFetchResults.cpp, LINE 216
2023-09-18 19:04:35 [109,310ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:04:36 [109,611ms] [Error] [omni.physx.tensors.plugin] CUDA error: an illegal memory access was encountered: ../../../source/extensions/omni.physx.tensors/plugins/gpu/GpuArticulationView.cpp: 71
Error executing job with overrides: ['task=FrankaCabinet']
Traceback (most recent call last):
  File "scripts/rlgames_train.py", line 114, in parse_hydra_configs
    task = initialize_task(cfg_dict, env)
  File "/workspace/omniisaacgymenvs/omniisaacgymenvs/utils/task_util.py", line 77, in initialize_task
    env.set_task(task=task, sim_params=sim_config.get_physics_params(), backend="torch", init_sim=init_sim)
  File "/workspace/omniisaacgymenvs/omniisaacgymenvs/envs/vec_env_rlgames.py", line 51, in set_task
    super().set_task(task, backend, sim_params, init_sim)
  File "/isaac-sim/exts/omni.isaac.gym/omni/isaac/gym/vec_env/vec_env_base.py", line 94, in set_task
    self._world.reset()
  File "/isaac-sim/exts/omni.isaac.core/omni/isaac/core/world/world.py", line 282, in reset
    self._scene._finalize(self.physics_sim_view)
  File "/isaac-sim/exts/omni.isaac.core/omni/isaac/core/scenes/scene.py", line 290, in _finalize
    articulated_view.initialize(physics_sim_view)
  File "/workspace/omniisaacgymenvs/omniisaacgymenvs/robots/articulations/views/franka_view.py", line 28, in initialize
    super().initialize(physics_sim_view)
  File "/isaac-sim/exts/omni.isaac.core/omni/isaac/core/articulations/articulation_view.py", line 218, in initialize
    self._default_kps, self._default_kds = self.get_gains(clone=True)
  File "/isaac-sim/exts/omni.isaac.core/omni/isaac/core/articulations/articulation_view.py", line 1673, in get_gains
    kds[self._backend_utils.expand_dims(indices, 1), joint_indices], device=self._device
  File "/isaac-sim/exts/omni.isaac.core/omni/isaac/core/utils/torch/tensor.py", line 58, in move_data
    return data.to(device=device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

After I switch to the self-created robot arm model, the maximum number is 2048 with the following GPU consumption.

If I raise the number to 3072, this error appears and obviously, the performance can be improved. For the model side, I have improved it through ways like reducing the number of triangular and vertices in meshing with mere enhancement.

2023-09-18 19:12:59 [20,725ms] [Error] [omni.physx.plugin] PhysX error: SynchronizeStreams cuEventRecord failed with error 700
, FILE /buildAgent/work/16dcef52b68a730f/source/gpucommon/include/PxgCudaUtils.h, LINE 53
2023-09-18 19:12:59 [20,726ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:12:59 [20,726ms] [Error] [omni.physx.plugin] PhysX error: SynchronizeStreams cuStreamWaitEvent failed with error 700
, FILE /buildAgent/work/16dcef52b68a730f/source/gpucommon/include/PxgCudaUtils.h, LINE 59
2023-09-18 19:12:59 [20,726ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:12:59 [20,726ms] [Error] [omni.physx.plugin] PhysX error: memcpy failed fail!
  700, FILE /buildAgent/work/16dcef52b68a730f/source/gpunarrowphase/src/PxgNarrowphaseCore.cpp, LINE 2077
2023-09-18 19:12:59 [20,726ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:12:59 [20,743ms] [Warning] [omni.physx.plugin] PhysX warning: Failed to allocate pinned memory., FILE /buildAgent/work/16dcef52b68a730f/source/gpucommon/src/PxgCudaMemoryAllocator.cpp, LINE 58
/isaac-sim/python.sh: line 41:  1300 Segmentation fault      (core dumped) $python_exe "$@" $args
There was an error running python

My goal is to create my own robots as many as I can and now there seems a large gap between my model and task FrankaCabinet.

Is there any parameter I need to change to break the limitation? I really need some suggestions and appreciate your reply!

Best,
Chay

Hi there, please try increasing the GPU buffer dimensions in the task config file, which can be found here for the FrankaCabinet task. Generally, the found lost pairs and aggregate pairs buffers are the ones that would likely need to be increased.