Cloud Native Core: CUDA Validator Container Fails

I’m trying to install Cloud Native Core v6.2 on an Ubuntu box running 20.04.3 LTS with an RTX 3060Ti. The installed NVIDIA GPU driver version is 470.86. CUDA version is 11.4. Container toolkit looks to be 1.10.0.

I’ve made use of the available Ansible playbook, leaving the default values in place apart from the hosts file, as instructed.

I can see the Kubenertes pods come up, except for a few, the most concerning of which is the CUDA validator pod.

I managed to get this out of the logs:

Failed to allocate device vector A (error code forward compatibility was attempted on non supported HW)!
[Vector addition of 50000 elements]

I’m not sure if there’s a version mismatch somewhere, but as I ran an NVIDIA Cloud Native Core Ansible playbook, I’d expect this to be taken care of.

Can anyone help? Thanks in advance!

cloud native core is intended to be used on NVIDIA certified systems

Such systems will consist of OEM servers and other NVIDIA datacenter hardware. The RTX 3060Ti is not a datacenter GPU.

The proximal issue here is that the container you loaded (in the CUDA validator pod) is expecting a newer GPU driver than 470.86. The container had compatibility libraries to try to address this situation, but that sort of compatibility is not supported and won’t work on GeForce hardware.

You might have better luck if you install the latest driver for your GPU (515.48.07 or newer, see here for 6.2) but I probably won’t be able to respond to further questions here as this is essentially an unsupported configuration.

(Yes, the GPU operator can update the GPU driver, but it is also not supported on GeForce hardware.)

Thank you kindly for the response. I will try your suggestion come Monday but do understand you’ll be unable to support further.

I can confirm that updating to the latest production-ready 515.x driver has fixed the issue. Thank you.

The only issue I have now is that the nvidia-driver-daemonset pod is stuck on Init (the nvidia-driver-ctr container seems to be stuck initializing and the logs say nothing else), but I understand that I’m on my own to figure out how to troubleshoot this.

Hopefully this thread helps someone else in the future.

Thanks again.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.