2 servers with same configuration - only one complains about CUDA constrains

Hello,

I’m encountering an issue that I’m struggling to resolve.
I have two servers, let’s call them ServerA and ServerB for simplicity.
They both have nearly identical configurations…

Both servers are equipped with:

  • Rocky Linux version 8.7
  • Docker version: 24.0.7 (containerd: 1.6.26; runc: 1.1.10)
  • Nvidia Container Toolkit version: 1.14.3
  • Nvidia Driver version: with CUDA 12.2

The key differences between the servers are:

  • ServerA is hosted on Azure, whereas ServerB operates in an AirGap Network.
  • ServerB runs on VMware with a GRID GPU and its corresponding driver (535.129.03 that came with GRID Bundle)
  • Both utilize the Nvidia A100 80GB GPU.

The container image I’m using was built from nvidia/cuda:12.3.
It’s a straightforward build, mainly adding some packages and setting a Python script as the entrypoint.

The issue arises when I run the same Docker command (docker run --gpus all) on both servers.
ServerA functions without any problems, but ServerB encounters CUDA constraints issues, requiring CUDA version >= 12.3 (see error below)

Although setting the environment variable NVIDIA_DISABLE_REQUIRE=true on ServerB bypasses this issue, I’m keen to understand why
ServerB is behaving differently from ServerA.

Thank you for any assistance


Error from ServerB:

Running hook #0:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.3, please update your driver to a newer version, or use an earlier cuda container: unknown.
1 Like