Isaac Sim Crashes on EC2 g6.12xlarge with 4 GPUs (2 GPUs work)

Isaac Sim Version:

4.2.0
Works with Isaac Sim Version 4.0.0

Operating System:

Ubuntu 22.04

GPU Information

  • Model: NVIDIA L4
  • Driver Version: 565.57.01

Topic Description

I am trying to run SDG scene on an EC2 instance (g6.12xlarge). When I use 2 GPUs, everything works fine, and SDG functions as expected. However, when I use 4 GPUs, Isaac Sim crashes. 4 GPUs worked with Isaac Sim 4.0.0

Error Messages

[74.025s] app ready
2024-12-03 16:27:38 [74,478ms] [Warning] [omni.kit.imgui_renderer.plugin] _createExtendCursor: No windowing.
2024-12-03 16:27:38 [74,478ms] [Warning] [omni.kit.imgui_renderer.plugin] _createExtendCursor: No windowing.
[75.632s] Simulation App Startup Complete
2024-12-03 16:27:53 [89,044ms] [Warning] [omni.hydra.scene_delegate.plugin] Calling getBypassRenderSkelMeshProcessing for prim /World/TableOutput.proto_collider_leg_01_id3 that has not been populated
2024-12-03 16:28:27 [122,836ms] [Warning] [omni.syntheticdata.plugin] OgnSdPostRenderVarToHost : rendervar copy from texture directly to host buffer is counter-performant. Please use copy from texture to device buffer first.
2024-12-03 16:28:31 [127,207ms] [Error] [carb.graphics-vulkan.plugin] GPU crash is detected. Shader debug is written into: /home/ubuntu/.local/share/ov/pkg/isaac-sim-4.2.0/kit/logs/Kit/Isaac-Sim/4.2/kit_20241203_162624-000072618313b250-0000724a819cdf90.nvdbg
2024-12-03 16:28:31 [127,209ms] [Error] [carb.graphics-vulkan.plugin] GPU crash is detected. Crash dump is written into: /home/ubuntu/.local/share/ov/pkg/isaac-sim-4.2.0/kit/logs/Kit/Isaac-Sim/4.2/kit_20241203_162624-0.nv-gpudmp
2024-12-03 16:28:31 [127,209ms] [Error] [carb.graphics-vulkan.plugin] GPU crash dump is successfully written
2024-12-03 16:29:27 [182,807ms] [Fatal] [rtx.scenedb.plugin] Waiting on Semaphore 6 for longer than 60s: Failure to complete CopyCommandList: Copy Context Geometry copy engine command list command list
2024-12-03 16:29:27 [182,899ms] [Fatal] [rtx.scenedb.plugin] Waiting on Semaphore 8 for longer than 60s: Failure to complete CopyCommandList: Copy Context Geometry copy engine command list command list
2024-12-03 16:29:27 [182,900ms] [Fatal] [rtx.scenedb.plugin] Waiting on Semaphore 9 for longer than 60s: Failure to complete CopyCommandList: Copy Context Geometry copy engine command list command list

Additional Information

|---------------------------------------------------------------------------------------------|
| Driver Version: 565.57.01 | Graphics API: Vulkan
|=============================================================================================|
| GPU | Name | Active | LDA | GPU Memory | Vendor-ID | LUID |
| | | | | | Device-ID | UUID |
| | | | | | Bus-ID | |
|---------------------------------------------------------------------------------------------|
| 0 | NVIDIA L4 | Yes: 0 | | 23034 MB | 10de | 0 |
| | | | | | 27b8 | c3475aba… |
| | | | | | 38 | |
|---------------------------------------------------------------------------------------------|
| 1 | NVIDIA L4 | Yes: 1 | | 23034 MB | 10de | 0 |
| | | | | | 27b8 | 10989674… |
| | | | | | 3a | |
|---------------------------------------------------------------------------------------------|
| 2 | NVIDIA L4 | | | 23034 MB | 10de | 0 |
| | | | | | 27b8 | 379e3e69… |
| | | | | | 3c | |
|---------------------------------------------------------------------------------------------|
| 3 | NVIDIA L4 | | | 23034 MB | 10de | 0 |
| | | | | | 27b8 | 4dad17a1… |
| | | | | | 3e | |
|=============================================================================================|
| OS: 22.04.5 LTS (Jammy Jellyfish) ubuntu, Version: 22.04.5, Kernel: 6.8.0-1019-aws
| Processor: AMD EPYC 7R13 Processor | Cores: 24 | Logical: 48
|---------------------------------------------------------------------------------------------|
| Total Memory (MB): 186124 | Free Memory: 164767
| Total Page/Swap (MB): 0 | Free Page/Swap: 0
|---------------------------------------------------------------------------------------------|

Could you provide the dump files?

Here you are:
kit_20241205_140539-0.zip (1.6 MB)

Is this issue specific to a certain scene? Can it be reproduced using the included examples?

We tried the same configuration (4 GPUs) on a different scene with the same crashing result.

In general, version 565 drivers are in beta and are not recommended for use (see Isaac Sim Requirements — Omniverse IsaacSim). Additionally, L4 has not been fully tested with Isaac Sim and may require special drivers. It is likely that installing the correct drivers on AWS will resolve the issue.

Could you please provide simple reproduction steps?

We have also tried the 535 driver, L4 GPU, T4 GPU, and A10G GPU, but we have gotten the same result.
We are going to try running your scene from this tutorial and will get back to you with the reproduction steps.

Edit: added dump files
dump_files.zip (1.7 MB)

4 GPUs, L4, Driver 550, same results

Reproduction steps:

  • Launch AWS instance g6.12xlarge, AMI ID - ami-0745b7d4092315796.
  • Install 550 nvidia-driver using: apt install nvidia-driver-550 libnvidia-common-550
  • Install Isaac Sim 4.2 using the Omniverse launcher on Ubuntu.
  • Run the SDG scene: ./python.sh standalone_examples/replicator/scene_based_sdg/scene_based_sdg.py

Python dependencies:

open3d==0.18.0
pyyaml==6.0.1
numpy==1.23.0
py-trees==2.2.3

Hi @booger.bubble1. Thanks for the repro steps. Can you provide the plaintext logfiles instead please?

Also, how do you setup the EC2 instance? Are you using our AMI or your own. It seems like you are running natively using Omniverse Launcher. Usually, we run Isaac Sim in a headless container on EC2 for SDG workflows.

Please note that we do not support the apt install method for installing drivers. Please install using the .run installer.
The 550.127.08 is recommended for L4 on data centers.

1 Like

Has this been resolved? Experiencing a similar issue on 4.5.0 and ubuntu 24.04.

@booger.bubble1 do you have any updates on this issue?

@Turowicz @VickNV
No progress in terms of using multi GPUs with Isaac as it doesn’t seem to work as designed - no matter what we tried or what configuration of AWS instance we tried things were crashing and GPUs weren’t used well.

What we did is we have a workflow of preparing multiple configuration files for our Replicator and we pass them to a workflow orchestration tool that groups them according to the number of GPUs and then we launch 1 config per GPU per group.

Unfortunately that was the only way and attempting to use mGPU with Isaac left many battlescars and a lot of engineering time wasted. Not sure we will ever attempt it, since our workflow works well.

I am getting this error with a single GPU tho

Thank you for sharing your experience. To help us investigate the multi-GPU crashes, could you please take a look at Sheikh_Dawood’s earlier post above and provide the requested details?