Expected behavior with NVIDIA GPU Operator and time-slicing on server with multiple gpus

Hi,

I have a single server with 4 identical gpus connected to OpenShift. Multiple teams plan to deploy LLM enabled apps on the server. We know that these apps will sit idle till a request comes in, so time-slicing seems to be the best configuration. I have A40 GPUs, so MIG is not possible.

We followed the docs at Time-Slicing GPUs in Kubernetes — NVIDIA GPU Operator setting renameByDefault to true and replicas to 10. So I can now see 40 nvidia.com/gpu.shared resources.

I launched ‘nvidia/cuda:13.0.2-base-ubuntu24.04’ and asked for 1 nvidia.com/gpu.shared. In the running container, NVIDIA_VISIBLE_DEVICES is set to ‘void’ and nvidia-smi only reports a single gpu. Is this expected behavior? I was anticipating seeing all 4 of the underlying gpus.

What I want to do at present is segment work. I have some teams prototyping stuff that I want to direct to gpu4, other teams have more production ready work that I wanted to isolate over on gpu1. I was hoping to use the NVIDIA_VISIBLE_DEVICES to control where work was being directed. Is there a way to do that?