Hi all,
I have a machine with 4 (NVIDIA TURBO RTX 2080 8GB GDDR6 HDMI/2DisplayPort/USB Type-C PCI-Express Video Card). They all have the Turing microarchitecture.
So I was installing Cuda 10.0 as the new driver for the machine, I notice that all of the shared libraries that have been produced from running the run file is 32-bit instead of 64-bit for the driver tool (nvidia-smi). I have also noticed a similar problem for Cuda 10.1 as well.
I do not know if this is related to the problem that I am having or not. So I have installed the driver regardless and whenever I run nvidia-smi, I get the following error message: Failed to initialize NVML: Function Not Found
Most sources that I have consulted already told me to reboot the machine. But the problematic thing is that this is a stateless compute node that is booted over a network using PXE instead of the normal booting from local disk.
So I am wondering if there is a solution for this that does not involve rebooting.
If it does involve rebooting, currently my set up is that I have nvidia-smi and also the cuda (10.0 and 10.1) toolkit mounted on an NFS mount point that is visible after the machine is booted. My bashrc file also has the path of both nvidia-smi and cuda tool kit sourced as well.
Thanks