I have 8 GPUs (RTX 4090) installed on my server(Ubuntu 22.04 - nvidia 550 driver/cuda 12.4 ). After running LLM inference with VLLM for some time, I encountered an issue. When I tried to check the status of the GPUs using nvidia-smi, I received the following error:

Unable to determine the device handle for GPU0000:76:00.0: Unknown Error

I attempted to resolve it by restarting the machine, but that didn’t work. nvidia-smi only showed 7 GPUs, and I found the following error in the kernel logs:

NVRM: GPU 0000:76:00.0: RmInitAdapter failed! (0x31:0x40:2628)

My question is, what does the error code 0x31:0x40:2628 mean? And, from which directions should I troubleshoot this issue?

Can someone help me, much appreciated 😀

