VkResult: ERROR_OUT_OF_DEVICE_MEMORY

Hi,

We encounterd an error after the 2023 release :

2023-11-16 13:59:01 [284,910ms] [Error] [carb.graphics-vulkan.plugin] VkResult: ERROR_OUT_OF_DEVICE_MEMORY
2023-11-16 13:59:01 [284,910ms] [Error] [carb.graphics-vulkan.plugin] vkAllocateMemory failed for flags: 0.
2023-11-16 13:59:01 [284,910ms] [Error] [gpu.foundation.plugin] Unable to allocate buffer
2023-11-16 13:59:01 [284,910ms] [Error] [gpu.foundation.plugin] Buffer creation failed for the device: 0.
2023-11-16 13:59:01 [284,910ms] [Error] [gpu.foundation.plugin] Failed to update params for RenderOp 453
2023-11-16 13:59:01 [284,910ms] [Error] [gpu.foundation.plugin] Failed to update params for RenderOp Cached PT ClearAll. Will not execute this or subsequent RenderGraph operations. Aborting RenderGraph execution
2023-11-16 13:59:01 [284,910ms] [Error] [carb.scenerenderer-rtx.plugin] Failed to execute RenderGraph on device 0. Error Code: 7
2023-11-16 13:59:02 [285,034ms] [Error] [carb.graphics-vulkan.plugin] VkResult: ERROR_OUT_OF_DEVICE_MEMORY
2023-11-16 13:59:02 [285,034ms] [Error] [carb.graphics-vulkan.plugin] vkAllocateMemory failed for flags: 2.
2023-11-16 13:59:02 [285,034ms] [Error] [gpu.foundation.plugin] Texture creation failed for the device: 0.

with sometimes this warning :

2023-11-16 14:12:58 [573,059ms] [Warning] [carb] Client omni.ui has acquired [carb::svg::Svg v0.1] 100 times. Consider accessing this interface with carb::getCachedInterface() (Performance warning)
2023-11-16 14:12:58 [573,060ms] [Warning] [carb] Client omni.ui has acquired [omni::kit::renderer::IRenderer v1.9] 100 times. Consider accessing this interface with carb::getCachedInterface() (Performance warning)

It’s happen when we run our workspace which include 5 cameras rendering and publishing ros2 rgb+pcd.
With one camera it works but with the five it crash.

On the 2022 release it works without problem.

Here is our setup : Ubuntu 22.04 with 64gb/ram
|---------------------------------------------------------------------------------------------|
| Driver Version: 525.85.05 | Graphics API: Vulkan
| GPU | Name | Active | LDA | GPU Memory | Vendor-ID | LUID |
|---------------------------------------------------------------------------------------------|
| 0 | NVIDIA GeForce RTX 4070 Ti | Yes: 0 | | 12528 MB | 10de | 0 |
|---------------------------------------------------------------------------------------------|
| 1 | Intel(R) Graphics (RPL-S) | | | 48048 MB | 8086 | 0 |

We will try a workaround by enabling/disabling ros2 publisher “on the fly” but I’m wondering if there is an issue somwhere with driver or anything else.

Thanks in advance and have a good day !

Hi @quentin.deyna

Were you able to resolve this issue? I’m encountering the same problem, so I’m curious if you have any updates on the error.

Thanks!

Hi @ksavevska - What Isaac Sim version are you using?

Sorry @ksavevska, I didn’t see your question :/

I haven’t exactly resolved the issue; I’m still pondering what it was referring to, but I’ve made some improvements.

I noticed a lack of vRAM even with an RTX4080. When running Isaac with 3xRGB and Depths ROS2 topics, consuming around 5 or 6 Gbits, alongside some greedy AI algorithms, the simulation crashes when the cameras’ views are generated and published, especially when starting the other algorithms.

With some optimization and utilizing another PC to run the AI parts in parallel, the performance improved. I’ve also reduced the number of simulated cameras, which sufficed for testing purposes.

One thing I plan to try is publishing 5 rgbd camera viewports at periodic timestamps with a small offset between each, instead of publishing at every frame.

Additionally, another pain point is that Rviz2 and rqt consume a lot of power and kill the fps, depending on what you’re logging.

Hope you’re not to stuck by this :/

Thank you, @quentin.deyna, for your response. I’ll try to optimize the code and observe the results.
However, when examining the system metrics, I noticed that no more than 40% of the vRAM is being utilized (during PPO learning with sb3 with a custom humanoid robot). Therefore I assume that the issue may not be due to a lack of vRAM.

Hi @rthaker,

I am using 2023.1.1 in a docker container with a 535.86.05 driver.

1 Like

@rthaker

I am having a similar issue with vram in the exact same docker container. My custom extension works just fine when running on the regular Isaac-Sim application, but when I try doing exacltly the same thing from within the docker container it says VkResult: ERROR_OUT_OF_DEVICE_MEMORY but I have only used 2GB out of my 24GB available.

Any help would be greatly appreciated!

The issue seems to be caused by some error related to Optix, and the solution was to not only pass “–gpus all” to docker but also “–runtime=nvidia”. I’m curious as to why this is required, from reading the docs it seems like specifying “–runtime=nvidia” is outdated, and in fact most gpu operations work. This feels like a niche issue.

@edward.schneeweiss looking at the latest docs using the runtime flag is still recommended.
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html

Is it possible that you have a second integrated gpu (like intel)? in which case specifying the runtime might be needed

Hi everyone, I have the same problem using official docker image. But also adding the flag --runtime=nvidia the error remains

Could you please create a new topic and include a link to this one? Thank you.