Genericaly test driver deployment

Hello, :)

I am one of the BlueBanquise stack developer (GitHub - bluebanquise/bluebanquise: A coherent Ansible roles collection to simply deploy clusters of nodes.).
Basically, the stack allows to provision Linux based clusters, and then specialize them (High Performance Computing, Rendering farms, etc).

We want to integrate native Nvidia support in our next release (2.0), but since most of our current users are using expensive hardware (A100, DGX, etc) we cannot afford to test with real enterprise hardware.

My question is: if we install drivers from an Ansible role, without any Nvidia hardware on the system, is there a way to validate installation worked fine and so validate the role ?
Would an “lsmod | grep -i nvidia” be an acceptable test ? Or can nvidia-smi command communicate with a loaded driver without hardware and just answer “no cards found” (which would also be OK, since it does not answer “cannot communicate with driver”) ?

With my best regards

Ox

If no nvidia hardware is installed, the kernel modules won’t load. So you can’t check for them. Rather their specific error message when trying to force-load them.
Also, please take note that on sxm systems like dgx, you’ll need to use the tesla driver release since the fabric-manager needs to be installed and started.

Dear @generix

Thanks a lot for this answer.
Ok, so no way to load kernel modules without hardware. I will study their error message as you propose.

I wonder if there is a way to get in touch with this kind of hardware on the cloud, without having an already configured environment (i.e. no driver installed, so I can do it myself in order to test.). I will try to find that.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.