Hi,
I’m playing with the repo OmniIsaacGymEnvs now and trying to increase the number of instances in a single environment. My GPU is RTX A5000, 24GB memory.
For the built-in tasks like FrankaCabinet, I set the minibatch_size as twice as the number of instances, and the largest number of instances I can go is 8096. The GPU performance is like this :
When I increased the number to 1.5*8196=12294, I got an error like this:
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] PhysX error: GPU integrateCoreParallel fail to launch kernel!!
, FILE /buildAgent/work/16dcef52b68a730f/source/gpusolver/src/PxgTGSCudaSolverCore.cpp, LINE 2393
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] PhysX error: SynchronizeStreams cuEventRecord failed with error 700
, FILE /buildAgent/work/16dcef52b68a730f/source/gpucommon/include/PxgCudaUtils.h, LINE 75
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] PhysX error: SynchronizeStreams cuStreamWaitEvent failed with error 700
, FILE /buildAgent/work/16dcef52b68a730f/source/gpucommon/include/PxgCudaUtils.h, LINE 81
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] PhysX error: GPU kernel 'markAggregateBoundsUpdated' failed to launch!!
, FILE /buildAgent/work/16dcef52b68a730f/source/gpubroadphase/src/PxgAABBManager.cpp, LINE 1206
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] PhysX error: SynchronizeStreams cuEventRecord failed with error 700
, FILE /buildAgent/work/16dcef52b68a730f/source/gpucommon/include/PxgCudaUtils.h, LINE 75
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] PhysX error: SynchronizeStreams cuStreamWaitEvent failed with error 700
, FILE /buildAgent/work/16dcef52b68a730f/source/gpucommon/include/PxgCudaUtils.h, LINE 81
2023-09-18 19:04:35 [109,109ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:04:35 [109,310ms] [Error] [omni.physx.plugin] PhysX error: PhysX Internal CUDA error. Simulation can not continue!, FILE /buildAgent/work/16dcef52b68a730f/source/physx/src/NpSceneFetchResults.cpp, LINE 216
2023-09-18 19:04:35 [109,310ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:04:36 [109,611ms] [Error] [omni.physx.tensors.plugin] CUDA error: an illegal memory access was encountered: ../../../source/extensions/omni.physx.tensors/plugins/gpu/GpuArticulationView.cpp: 71
Error executing job with overrides: ['task=FrankaCabinet']
Traceback (most recent call last):
File "scripts/rlgames_train.py", line 114, in parse_hydra_configs
task = initialize_task(cfg_dict, env)
File "/workspace/omniisaacgymenvs/omniisaacgymenvs/utils/task_util.py", line 77, in initialize_task
env.set_task(task=task, sim_params=sim_config.get_physics_params(), backend="torch", init_sim=init_sim)
File "/workspace/omniisaacgymenvs/omniisaacgymenvs/envs/vec_env_rlgames.py", line 51, in set_task
super().set_task(task, backend, sim_params, init_sim)
File "/isaac-sim/exts/omni.isaac.gym/omni/isaac/gym/vec_env/vec_env_base.py", line 94, in set_task
self._world.reset()
File "/isaac-sim/exts/omni.isaac.core/omni/isaac/core/world/world.py", line 282, in reset
self._scene._finalize(self.physics_sim_view)
File "/isaac-sim/exts/omni.isaac.core/omni/isaac/core/scenes/scene.py", line 290, in _finalize
articulated_view.initialize(physics_sim_view)
File "/workspace/omniisaacgymenvs/omniisaacgymenvs/robots/articulations/views/franka_view.py", line 28, in initialize
super().initialize(physics_sim_view)
File "/isaac-sim/exts/omni.isaac.core/omni/isaac/core/articulations/articulation_view.py", line 218, in initialize
self._default_kps, self._default_kds = self.get_gains(clone=True)
File "/isaac-sim/exts/omni.isaac.core/omni/isaac/core/articulations/articulation_view.py", line 1673, in get_gains
kds[self._backend_utils.expand_dims(indices, 1), joint_indices], device=self._device
File "/isaac-sim/exts/omni.isaac.core/omni/isaac/core/utils/torch/tensor.py", line 58, in move_data
return data.to(device=device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
After I switch to the self-created robot arm model, the maximum number is 2048 with the following GPU consumption.
If I raise the number to 3072, this error appears and obviously, the performance can be improved. For the model side, I have improved it through ways like reducing the number of triangular and vertices in meshing with mere enhancement.
2023-09-18 19:12:59 [20,725ms] [Error] [omni.physx.plugin] PhysX error: SynchronizeStreams cuEventRecord failed with error 700
, FILE /buildAgent/work/16dcef52b68a730f/source/gpucommon/include/PxgCudaUtils.h, LINE 53
2023-09-18 19:12:59 [20,726ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:12:59 [20,726ms] [Error] [omni.physx.plugin] PhysX error: SynchronizeStreams cuStreamWaitEvent failed with error 700
, FILE /buildAgent/work/16dcef52b68a730f/source/gpucommon/include/PxgCudaUtils.h, LINE 59
2023-09-18 19:12:59 [20,726ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:12:59 [20,726ms] [Error] [omni.physx.plugin] PhysX error: memcpy failed fail!
700, FILE /buildAgent/work/16dcef52b68a730f/source/gpunarrowphase/src/PxgNarrowphaseCore.cpp, LINE 2077
2023-09-18 19:12:59 [20,726ms] [Error] [omni.physx.plugin] Cuda context manager error, simulation will be stopped and new cuda context manager will be created.
2023-09-18 19:12:59 [20,743ms] [Warning] [omni.physx.plugin] PhysX warning: Failed to allocate pinned memory., FILE /buildAgent/work/16dcef52b68a730f/source/gpucommon/src/PxgCudaMemoryAllocator.cpp, LINE 58
/isaac-sim/python.sh: line 41: 1300 Segmentation fault (core dumped) $python_exe "$@" $args
There was an error running python
My goal is to create my own robots as many as I can and now there seems a large gap between my model and task FrankaCabinet.
Is there any parameter I need to change to break the limitation? I really need some suggestions and appreciate your reply!
Best,
Chay