Hello, I have an exciting issue to share!
We’re running on an NVIDIA GeForce RTX 5090.
- NVIDIA-SMI 570.153.02
- Driver Version: 570.153.02
- CUDA Version: 12.8
We’re using a Docker container with GPU support for our project.
When we run 4 containers together, the following error occurs:
CUDA error: CUDA-capable device(s) is/are busy or unavailable
Search for `cudaErrorDevicesUnavailable' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
The error occurs intermittently, but we’re concerned. Please help!