Help analyzing GPU crash when using Isaac Sim 4.5?

Isaac Sim Version

4.5.0
4.2.0
4.1.0
4.0.0
4.5.0
2023.1.1
2023.1.0-hotfix.1
Other (please specify):

Operating System

Ubuntu 22.04
Ubuntu 20.04
Windows 11
Windows 10
Other (please specify):

GPU Information

  • Model: RTX 4090
  • Driver Version: 560.94

Topic Description

Detailed Description

I have an extension that runs in Isaac Sim 4.2. I am trying to make it work with Isaac Sim 4.5
After updating the code to accommodate the various breaking changes as shown in this post I am now facing issue where Isaac Sim keeps crashing.

I am not familiar with analyzing these logs, but it seems like something to do with Vulkan and shaders.

[carb.graphics-vulkan.plugin] GPU crash is detected. Trying to write shader debug info into: C:/Users/mattm/.nvidia-omniverse/logs/Kit/Isaac-Sim Full/4.5/kit_20250207_130757-000002549af54320-00000254c6ada970.nvdbg
[carb.graphics-vulkan.plugin] Shader debug info successfully written
...
[carb.graphics-vulkan.plugin] VkResult: ERROR_DEVICE_LOST

See the full log
kit_20250207_130757.log (2.5 MB)

There are also many .nvdbg files, although I am not sure how to view them

I was hoping someone who is more familiar with these lower-level GPU features can help assist diagnosis

What are the likely causes of these GPU crashes?
What changes were made from 4.2 to 4.5 which would affect Vulkan or Shader?

What is also strange is that the crash occurs while the extension is idle. Meaning the error does not occur immediately after a user action during an attempted operation, which it would be easier to find the cause.

Event sequence:

  1. Open extension
  2. Click a button to load the scene
  3. The scene finishes loading
  4. wait
  5. crash!

I think logs confirm this with sequential time stamps 127.957ms and 160.565ms.

2025-02-07 21:45:04 [127,954ms] [Info] [omni.ext.plugin] [ext: omni.isaac.ros2_bridge-3.0.2] started, startup time: 7 (ms)
2025-02-07 21:45:04 [127,957ms] [Info] [demo_collector.helpers.ros_client] Already connected to ROS Server at address: localhost:9090
2025-02-07 21:45:36 [160,565ms] [Error] [omni.kit.renderer.plugin] acquireNextFrameBufferNoWait: Failed to begin frame command list.
2025-02-07 21:45:36 [160,565ms] [Error] [gpu.foundation.plugin] A GPU crash occurred. Exiting the application...

Here is second log with error reproducing if it helps

kit_20250207_134256.log (2.5 MB)

Do you still encounter the issue after relaunching Isaac Sim and following the same sequence?

Yes

Although, I am not sure I understand the question as the answer seemed implicit from my first two posts.
Given that Isaac Sim crashes, it means the only way I could have reproduced it a second time to provide the second crash log was from relaunching.

I have reproduced a third time

2025-02-11 17:17:11 [84,802ms] [Info] [omni.ext.plugin] [ext: omni.isaac.ros2_bridge-3.0.2] started, startup time: 6 (ms)
2025-02-11 17:17:11 [84,805ms] [Info] [demo_collector.helpers.ros_client] Already connected to ROS Server at address: localhost:9090
2025-02-11 17:17:40 [113,675ms] [Error] [carb.graphics-vulkan.plugin] GPU crash is detected. Trying to write shader debug info into: C:/Users/mattm/.nvidia-omniverse/logs/Kit/Isaac-Sim Full/4.5/kit_20250211_091546-bba7fda86fcabf05-0000014b8d6eb7c0.nvdbg
2025-02-11 17:17:40 [113,676ms] [Error] [carb.graphics-vulkan.plugin] Shader debug info successfully written
2025-02-11 17:17:40 [113,678ms] [Error] [carb.graphics-vulkan.plugin] GPU crash is detected. Trying to write shader debug info into: C:/Users/mattm/.nvidia-omniverse/logs/Kit/Isaac-Sim Full/4.5/kit_20250211_091546-0000014c1c0be830-0000014bdfdae540.nvdbg
2025-02-11 17:17:40 [113,679ms] [Error] [carb.graphics-vulkan.plugin] Shader debug info successfully written

kit_20250211_091546.log (2.5 MB)

I notice there are slight variations in the logs each time the crash occurs. They are all GPU crashes, but some have more details than others. For example, in yet another reproduction, I see

2025-02-11 17:25:58 [177,097ms] [Info] [omni.ext.plugin] [ext: omni.isaac.ros2_bridge-3.0.2] started, startup time: 3 (ms)
2025-02-11 17:25:58 [177,100ms] [Info] [demo_collector.helpers.ros_client] Already connected to ROS Server at address: localhost:9090
2025-02-11 17:26:12 [191,037ms] [Error] [omni.kit.renderer.plugin] Failed to begin render graph.
2025-02-11 17:26:12 [191,048ms] [Error] [omni.kit.renderer.plugin] acquireNextFrameBufferNoWait: Failed to begin frame command list.
2025-02-11 17:26:12 [191,048ms] [Error] [gpu.foundation.plugin] A GPU crash occurred. Exiting the application...
Reasons for the failure: a device lost, out of memory, or an unexpected bug.
2025-02-11 17:26:12 [191,049ms] [Warning] [gpu.foundation.plugin] polling aftermath dump status ...

Which shows “Failed to begin render graph”. This reminds me of my other post about Hydra error related to HydraEngine::render failed to end the compute graph: error code 6

And later in the logs instead of seeing VkResult: ERROR_DEVICE_LOST I see errors with memory access GPU pagefault occured on virtual address(0x000000000e610000)
(I would guess attempting to access memory that does not exist?)

2025-02-11 17:26:12 [191,092ms] [Warning] [carb.graphics-vulkan.plugin] GPU pagefault occured on virtual address(0x000000000e610000)
2025-02-11 17:26:12 [191,092ms] [Warning] [carb.graphics-vulkan.plugin] GPU pagefault no associated data found closest allocations were:
2025-02-11 17:26:12 [191,092ms] [Warning] [carb.graphics-vulkan.plugin] 1: AddressRange(0x0000000000000000 - 0x0000000000000000) with name(dummy to ignore)
2025-02-11 17:26:12 [191,092ms] [Warning] [carb.graphics-vulkan.plugin] 2: AddressRange(0xffffffffffffffff - 0xffffffffffffffff) with name(dummy to ignore)
2025-02-11 17:26:12 [191,099ms] [Warning] [carb.graphics-vulkan.plugin] failed to find shader info with aftermath hash 0
2025-02-11 17:26:12 [191,102ms] [Warning] [carb.graphics-vulkan.plugin] failed to find shader info with aftermath hash 0

kit_20250211_092301.log (2.5 MB)

1 Like

We encounter the same issue in 4.5 but not 4.2.

1 Like

What is your GPU? the driver and the VRAM?

|---------------------------------------------------------------------------------------------|
| Driver Version: 566.36        | Graphics API: D3D12
|=============================================================================================|
| GPU | Name                             | Active | LDA | GPU Memory | Vendor-ID | LUID       |
|     |                                  |        |     |            | Device-ID | UUID       |
|     |                                  |        |     |            | Bus-ID    |            |
|---------------------------------------------------------------------------------------------|
| 0   | NVIDIA GeForce RTX 4090          | Yes: 0 |     | 24142   MB | 10de      | 6c3e0100.. |
|     |                                  |        |     |            | 2684      | 0          |
|     |                                  |        |     |            | 1         |            |
|---------------------------------------------------------------------------------------------|
| 1   | AMD Radeon(TM) Graphics          |        |     | 485     MB | 1002      | c34c0100.. |
|     |                                  |        |     |            | 164e      | 0          |
|     |                                  |        |     |            | N/A       |            |
|=============================================================================================|
| OS: Windows 11 Pro, Version: 10.0 (23H2), Build: 22631, Kernel: 10.0.22621.4746
| Processor: AMD Ryzen 9 7950X 16-Core Processor             | Cores: 16 | Logical: 32
|---------------------------------------------------------------------------------------------|
| Total Memory (MB): 64629 | Free Memory: 49484
| Total Page/Swap (MB): 128877 | Free Page/Swap: 109493
|---------------------------------------------------------------------------------------------|

please try to bring down your driver version to the one recommended in doc and see if that help

I updated to 572.xx and this did not improve the situation. Still crashing.

You are upgrading instead of downgrading

Ok, I will downgrade to 537.58

But this driver is very old (10/2023). Is this a serious recommendation?

1 Like

Downgrading to 537.58 keeps Isaac Sim 4.5 crashing.
4.2 ist working fine with all driver versions.

Thanks for confirming the issue Lars!
Also, now that we have more attention on this post from moderators, I would like to remind of these questions in the OP which I think may help a more targeted solution than tediousness of reproducing under different conditions.

I think given the fact that Isaac Sim 4.2 and 4.5 would be using the same NVidia GPU driver and it works on 4.2, wouldn’t that mean the driver is NOT the problem? OR are you implying 4.5 uses different features of the driver which would be an implicit answer to my second question.

After I downgraded the Driver to the listed one in the documentation (537.x) it still failed. After that I updated to the most current driver 572.x and restarted the machine. After that process it is not crashing anymore on my side. Which is weird.

The difference is that I don’t use the ROS bridge. But the replicator sdk for capturing point clouds but I got the same issue in the first place.

Could you share your steps to reproduce the issue by using the Replicator SDK for capturing point clouds? Can it be reproduced by running a standalone example application?

I further investigated my issue:
It was caused be the unnecessary switch of the viewport to capture the point cloud via replicator from the active viewport. Since I removed this necessity, Isaac Sim 4.5 is not crashing anymore for me.

1 Like

Which specific extension are you opening? Are you using the “Isaac Example” extension and then loading one of the sample scenes?

No, I am using a custom extension that worked in Isaac Sim 4.2
You can see it in logs called demo_collector

I wanted to clear up some statements on this thread

Lars said “same” so I assumed it was a similar [carb.graphics-vulkan.plugin] GPU crash is detected; however, he said his crash was avoided by “[removing an] unnecessary switch of the viewport to capture the point cloud via replicator from the active viewport”

I am not performing this action of switching viewport or using replicator and yet I still have the GPU crash.

This means the similar GPU crashes caused by different conditions and thus we have different issues. The crash is still blocking our use of Isaac Sim 4.5.

It might help if @lars.beier could share one of this crash logs so we could compare with mine and perhaps find more clues by noticing differences in sequence of operations. For example, perhaps Lars logs do not say VkResult: ERROR_DEVICE_LOST

I think this thread should remain open until we get answers to the original questions and know why the crashes occur.
If the answers cannot be given due lack of knowledge or capability, it is understandable given this is very low level, but at least we would know not to expect them.

Even if lars found a way to prevent the crash from happening, it doesn’t explain why the crash is happening. Ideally Isaac Sim should either prevent invalid operations it does not support or fail more gracefully and only crash the viewport component.

1 Like

Have you tried using the driver version recommended for Windows in Isaac Sim Requirements — Isaac Sim Documentation?