cudaErrorDevicesUnavailable error occured..... help!

Hello, I have an exciting issue to share!

We’re running on an NVIDIA GeForce RTX 5090.

  • NVIDIA-SMI 570.153.02
  • Driver Version: 570.153.02
  • CUDA Version: 12.8

We’re using a Docker container with GPU support for our project.

When we run 4 containers together, the following error occurs:

CUDA error: CUDA-capable device(s) is/are busy or unavailable
Search for `cudaErrorDevicesUnavailable' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The error occurs intermittently, but we’re concerned. Please help!

Hi, were you about to figure out a solution for this? We’re running into something pretty similar now