Hi,
A GPU enabled cluster I use recently upgraded our nvidia driver from 367.48 to 375.26, and that seems to have broken the driver API. A minimal example is:
#include <stdio.h>
#include <cuda.h>
int main()
{
CUresult result;
result = cuInit(0);
printf("Result = %d\n", (int) result);
return 0;
}
which compiled with nvcc gives the output
Result = 999
rather than the “0” I’d hope for. The kernel module is present:
$ lsmod | grep nvi
nvidia 11944366 0
i2c_core 40756 7 ast,drm,igb,i2c_i801,drm_kms_helper,i2c_algo_bit,nvidia
and seems to be the right version:
$ dmesg | grep 375.26
[ 11.645234] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 375.26 Thu Dec 8 18:36:43 PST 2016 (using threaded interrupts)
Also, it appears the test executable above is linking to that version:
$ ldd driver-test
linux-vdso.so.1 => (0x00007fff74bc9000)
libcuda.so.1 => /lib64/libcuda.so.1 (0x00007f0da3bbe000)
librt.so.1 => /lib64/librt.so.1 (0x00007f0da39b6000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f0da3799000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f0da3595000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f0da328c000)
libm.so.6 => /lib64/libm.so.6 (0x00007f0da2f89000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f0da2d73000)
libc.so.6 => /lib64/libc.so.6 (0x00007f0da29b2000)
libnvidia-fatbinaryloader.so.375.26 => /lib64/libnvidia-fatbinaryloader.so.375.26 (0x00007f0da2765000)
/lib64/ld-linux-x86-64.so.2 (0x00007f0da45e3000)
Any suggestions on how to debug this further?
Thanks,
Josh