Problems with CUDA drivers for NVIDIA Hardware

I have integrated in a system following NVIDIA devices:

  1. NVIDIA Geforce 9800 GT Graphic Card
  2. Tesla K40m
  3. Tesla M2050

I can see in a Device Manager from Win 10 that all 3 devices are properly initialized and 9800 GT is working according to expectation as a Graphic Card (Driver Date: 17/08/2015, Driver Version 9.18.13.4181). For Both Teslas I can see that the driver is the same - Driver Date: 31/01/2018 and Driver Version: 23.21.13.9085.
I would like to test some CUDA Samples v10.1 like 1_Utilities -> deviceQuery and deviceQueryDrv. For
running deviceQuery I get following printout (error):


C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.1\1_Utilities\deviceQuery…/…/bin/win64/Debug/deviceQuery.exe Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.1\1_Utilities\deviceQuery…/…/bin/win64/Debug/deviceQuery.exe (process 7280) exited with code 1.
To automatically close the console when debugging stops, enable Tools->Options->Debugging->Automatically close the console when debugging stops.
Press any key to close this window . . .

When I start deviceQueryDrv I get the following printout:

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.1\1_Utilities\deviceQueryDrv…/…/bin/win64/Debug/deviceQueryDrv.exe Starting…

CUDA Device Query (Driver API) statically linked version
Detected 3 CUDA Capable device(s)

Device 0: “Tesla K40m”
CUDA Driver Version: 6.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11520 MBytes (12079398912 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Max Texture Dimension Sizes 1D=(65536) 2D=(65536, 65536) 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
CUDA Device Driver Mode (TCC or WDDM): TCC (Tesla Compute Cluster Driver)
Device supports Unified Addressing (UVA): Yes
cuDeviceGetAttribute returned 1
-> CUDA_ERROR_INVALID_VALUE

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.1\1_Utilities\deviceQueryDrv…/…/bin/win64/Debug/deviceQueryDrv.exe (process 7904) exited with code 0.
To automatically close the console when debugging stops, enable Tools->Options->Debugging->Automatically close the console when debugging stops.
Press any key to close this window . . .

My question: What to do to make the CUDA samples work? Which drivers to uninstal/install? Probably this is a compatibility issue.
Best regards
Simon

You can only use 1 driver version for whatever cards you have installed. This is true both on windows and linux.

Your 9800GT is a very old GPU - from about 2007-2009 timeframe. The last CUDA version that supported that GPU is CUDA 6.5. You cannot use newer CUDA versions than that, as long as the 9800GT is in your system. It is also why the driver versions will be restricted to something like R343 or older - that is why you have a 341.81 driver.

If you would like to use all 3 GPUs (even if you’re not using them for CUDA) you must use CUDA 6.5 or older. If you want to use a newer CUDA version, you will have to physically remove the incompatible GPUs from your system. The last CUDA version that supported the M2050 is CUDA 8.0. So CUDA 10.1 will not work with those GPUs.

To use CUDA 10.1 properly, you would have to physically remove the 9800GT and Tesla M2050. There are many many questions here on this forum with answers that are indicating the same thing.

Also note that the M2050 and K40m are GPUs that are only designed to be hosted in a properly qualified server. Any other usage may result in overheating. You can also find many questions/answers discussing that on this forum as well.

Thank you very much for you quick answer. As I can see CUDA 6.5 is only supported for win 8.1. Will it work on win10? Is graphic card QUADRO K4000 supported under CUDA 10.1?
Regarding heat removal from M2050 and K40m, they were modified and I have installed water cooling for both cards and I don’t think overheating will much be a problem here.

CUDA versions after CUDA 8.0 (so, CUDA 9.0, 9.1, 9.2, 10.0, 10.1, currently) support GPUs of compute capability 3.0 or higher. Your K4000 should have a compute capability of 3.0 or higher.

The support matrix for CUDA 6.5 is listed in the CUDA 6.5 (windows) installation guide:

http://developer.download.nvidia.com/compute/cuda/6_5/rel/docs/CUDA_Getting_Started_Windows.pdf

I don’t see any support listed for Windows 10. I have no idea if it will work on windows 10.

Hello Robert
Thank you very much for your support again. I have installed CUDA 6.5 and MS Visual Studio 2013 without a problem on my Win10. I have run deviceQueryDrv Project and got this printout:


C:\ProgramData\NVIDIA Corporation\CUDA Samples\v6.5\1_Utilities\deviceQueryDrv…/…/bin/win32/Debug/deviceQueryDrv.exe Starting…

CUDA Device Query (Driver API) statically linked version
Detected 3 CUDA Capable device(s)

Device 0: “Tesla K40m”
CUDA Driver Version: 6.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 4096 MBytes (4294967295 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Max Texture Dimension Sizes 1D=(65536) 2D=(65536, 65536) 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
CUDA Device Driver Mode (TCC or WDDM): TCC (Tesla Compute Cluster Driver)
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 5 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: “GeForce 9800 GT”
CUDA Driver Version: 6.5
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 512 MBytes (536870912 bytes)
(14) Multiprocessors, ( 8) CUDA Cores/MP: 112 CUDA Cores
GPU Clock rate: 1500 MHz (1.50 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 256-bit
Max Texture Dimension Sizes 1D=(8192) 2D=(65536, 32768) 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Texture alignment: 256 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 6 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: “Tesla M2050”
CUDA Driver Version: 6.5
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2688 MBytes (2818244608 bytes)
(14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1546 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Sizes 1D=(65536) 2D=(65536, 65535) 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
CUDA Device Driver Mode (TCC or WDDM): TCC (Tesla Compute Cluster Driver)
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 10 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Peer-to-Peer (P2P) access from Tesla K40m (GPU0) -> Tesla M2050 (GPU2) : No
Peer-to-Peer (P2P) access from Tesla M2050 (GPU2) -> Tesla K40m (GPU0) : No
Result = PASS


I thing that CUDA software is running as expected in this configuration. Now I have some other questions:

  1. Can software be configured in that way that the PCI-E cards - Teslas communicate between themselves without the intervention of the processor (over pci-e switch)? Is this functionality called Peer-to-Peer?
  2. Can firmware of the Tesla card be somehow changed (adapted to special needs)? Is this proccess explained somewhere? Is any documentation available about the machine code for different Tesla GPU? And how this is connected to pci-e interface?
  1. Yes, depending on system design. Yes it is referred to as Peer-to-Peer. You can find all sorts of resources on the web discussing Peer to Peer.

  2. The firmware of GPUs cannot be changed by the end user. CUDA documentation is available at docs.nvidia.com