Problems with CUDA drivers for NVIDIA Hardware

smal.xyz · April 7, 2019, 12:12pm

I have integrated in a system following NVIDIA devices:

NVIDIA Geforce 9800 GT Graphic Card
Tesla K40m
Tesla M2050

I can see in a Device Manager from Win 10 that all 3 devices are properly initialized and 9800 GT is working according to expectation as a Graphic Card (Driver Date: 17/08/2015, Driver Version 9.18.13.4181). For Both Teslas I can see that the driver is the same - Driver Date: 31/01/2018 and Driver Version: 23.21.13.9085.
I would like to test some CUDA Samples v10.1 like 1_Utilities → deviceQuery and deviceQueryDrv. For
running deviceQuery I get following printout (error):

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.1\1_Utilities\deviceQuery../…/bin/win64/Debug/deviceQuery.exe Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
→ CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.1\1_Utilities\deviceQuery../…/bin/win64/Debug/deviceQuery.exe (process 7280) exited with code 1.
To automatically close the console when debugging stops, enable Tools->Options->Debugging->Automatically close the console when debugging stops.
Press any key to close this window . . .

When I start deviceQueryDrv I get the following printout:

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.1\1_Utilities\deviceQueryDrv../…/bin/win64/Debug/deviceQueryDrv.exe Starting…

CUDA Device Query (Driver API) statically linked version
Detected 3 CUDA Capable device(s)

Device 0: “Tesla K40m”
CUDA Driver Version: 6.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11520 MBytes (12079398912 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Max Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Max Texture Dimension Sizes 1D=(65536) 2D=(65536, 65536) 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
CUDA Device Driver Mode (TCC or WDDM): TCC (Tesla Compute Cluster Driver)
Device supports Unified Addressing (UVA): Yes
cuDeviceGetAttribute returned 1
→ CUDA_ERROR_INVALID_VALUE

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.1\1_Utilities\deviceQueryDrv../…/bin/win64/Debug/deviceQueryDrv.exe (process 7904) exited with code 0.
To automatically close the console when debugging stops, enable Tools->Options->Debugging->Automatically close the console when debugging stops.
Press any key to close this window . . .

My question: What to do to make the CUDA samples work? Which drivers to uninstal/install? Probably this is a compatibility issue.
Best regards
Simon

Robert_Crovella · April 7, 2019, 7:49pm

You can only use 1 driver version for whatever cards you have installed. This is true both on windows and linux.

Your 9800GT is a very old GPU - from about 2007-2009 timeframe. The last CUDA version that supported that GPU is CUDA 6.5. You cannot use newer CUDA versions than that, as long as the 9800GT is in your system. It is also why the driver versions will be restricted to something like R343 or older - that is why you have a 341.81 driver.

If you would like to use all 3 GPUs (even if you’re not using them for CUDA) you must use CUDA 6.5 or older. If you want to use a newer CUDA version, you will have to physically remove the incompatible GPUs from your system. The last CUDA version that supported the M2050 is CUDA 8.0. So CUDA 10.1 will not work with those GPUs.

To use CUDA 10.1 properly, you would have to physically remove the 9800GT and Tesla M2050. There are many many questions here on this forum with answers that are indicating the same thing.

Also note that the M2050 and K40m are GPUs that are only designed to be hosted in a properly qualified server. Any other usage may result in overheating. You can also find many questions/answers discussing that on this forum as well.

smal.xyz · April 7, 2019, 8:27pm

Thank you very much for you quick answer. As I can see CUDA 6.5 is only supported for win 8.1. Will it work on win10? Is graphic card QUADRO K4000 supported under CUDA 10.1?
Regarding heat removal from M2050 and K40m, they were modified and I have installed water cooling for both cards and I don’t think overheating will much be a problem here.

Robert_Crovella · April 7, 2019, 8:43pm

CUDA versions after CUDA 8.0 (so, CUDA 9.0, 9.1, 9.2, 10.0, 10.1, currently) support GPUs of compute capability 3.0 or higher. Your K4000 should have a compute capability of 3.0 or higher.

The support matrix for CUDA 6.5 is listed in the CUDA 6.5 (windows) installation guide:

[url]http://developer.download.nvidia.com/compute/cuda/6_5/rel/docs/CUDA_Getting_Started_Windows.pdf[/url]

I don’t see any support listed for Windows 10. I have no idea if it will work on windows 10.

smal.xyz · April 8, 2019, 9:00pm

Hello Robert
Thank you very much for your support again. I have installed CUDA 6.5 and MS Visual Studio 2013 without a problem on my Win10. I have run deviceQueryDrv Project and got this printout:

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v6.5\1_Utilities\deviceQueryDrv../…/bin/win32/Debug/deviceQueryDrv.exe Starting…

CUDA Device Query (Driver API) statically linked version
Detected 3 CUDA Capable device(s)

Device 0: “Tesla K40m”
CUDA Driver Version: 6.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 4096 MBytes (4294967295 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
GPU Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Max Texture Dimension Sizes 1D=(65536) 2D=(65536, 65536) 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
CUDA Device Driver Mode (TCC or WDDM): TCC (Tesla Compute Cluster Driver)
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 5 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: “GeForce 9800 GT”
CUDA Driver Version: 6.5
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 512 MBytes (536870912 bytes)
(14) Multiprocessors, ( 8) CUDA Cores/MP: 112 CUDA Cores
GPU Clock rate: 1500 MHz (1.50 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 256-bit
Max Texture Dimension Sizes 1D=(8192) 2D=(65536, 32768) 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Texture alignment: 256 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 6 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: “Tesla M2050”
CUDA Driver Version: 6.5
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2688 MBytes (2818244608 bytes)
(14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1546 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Sizes 1D=(65536) 2D=(65536, 65535) 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
CUDA Device Driver Mode (TCC or WDDM): TCC (Tesla Compute Cluster Driver)
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 10 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Peer-to-Peer (P2P) access from Tesla K40m (GPU0) → Tesla M2050 (GPU2) : No
Peer-to-Peer (P2P) access from Tesla M2050 (GPU2) → Tesla K40m (GPU0) : No
Result = PASS

I thing that CUDA software is running as expected in this configuration. Now I have some other questions:

Can software be configured in that way that the PCI-E cards - Teslas communicate between themselves without the intervention of the processor (over pci-e switch)? Is this functionality called Peer-to-Peer?
Can firmware of the Tesla card be somehow changed (adapted to special needs)? Is this proccess explained somewhere? Is any documentation available about the machine code for different Tesla GPU? And how this is connected to pci-e interface?

Robert_Crovella · April 8, 2019, 9:04pm

Yes, depending on system design. Yes it is referred to as Peer-to-Peer. You can find all sorts of resources on the web discussing Peer to Peer.
The firmware of GPUs cannot be changed by the end user. CUDA documentation is available at docs.nvidia.com

smal.xyz · October 23, 2020, 11:53am

Hello all
At the moment I have integrated into system following pcie cards:

NVIDIA Geforce 9800 GT Graphic Card
Tesla K40m
Tesla K40m

They are recognized by WIN10 and I can use them to my satisfactory extend. Now I would like to upgrade the system from 9800GT to GTX780 which is compute capability 3.5 compatible and thus it would allow me to migrate to newest CUDA software. But I have some difficulties integrating GTX780 into my system. Without any installed K40m cards this graphic card shows only black screen with BIOS set “above 4G encoding” active. If I disable this option GTX780 works fine. Otherwise K40m are not recognized as ok and this unusable. With this option set K40m are ok and usable in the device manager but only if I use 9800GT.
The question here is how to persuade GTX780 to work with above 4G encoding active?
Regards
Simon

njuffa · October 23, 2020, 9:00pm

This is really a motherboard / chipset / system BIOS issue, so you might want to check the support venues for your specific motherboard or host system. I am reasonably sure that the K40s require “above 4G decoding” to be active because these Tesla boards have huge PCIe apertures compared to consumer GPUs. Depending on your platform, it is possible that you are simply running out of available PCIe resources. If there are other PCIe devices you can disable, you might want to try that.

(1) Are you getting any error message like “PCIe resource conflict”, “insufficient PCIe resources”, or the like?
(2) Has the system BIOS been updated to the latest available version?
(3) Does the system work with one K40m plus the GTX 780?
(4) Does the system’s power supply provide enough additional wattage? I don’t have the exact number in front of me, but I think the 9800GT draws around 100W, but the GTX 780 draws around 250W, so including head room (safety margin) the PSU needs to be able to supply another 200W to 250W compared to your prior configuration.

smal.xyz · October 23, 2020, 9:33pm

Hi
I am using Asrock X99 WS-E Motherboard with 8GB RAM. Answers:

When I am using GTX780 I see only a black screen with signal present. In case I am using 9800GT every thing is fine no errors and I can use CUDA library (the old one of course 6.5).
Yes, I have downloaded the latest available BIOS firmware 3.6 for this motherboard.
No. Black screen. I see black screen even if there is no additional pcie card installed. Only GTX780 with above 4G decoding. If I turn 4G decoding off (with 9800GT card, then replaced with GTX780) GTX works fine. But then Teslas are not ok.
I have build in 1300W Supernova power supply.

smal.xyz · October 27, 2020, 6:12pm

Hi
I have found a solution for a working GTX780 with two Teslas K40. Problem here was because the graphic card GTX780 doesn’t want to work in graphical mode with “above 4G decoding” active. It is the same problem with Quadro K4000. The older graphic cards like GeForce 8400GS, 9800GTX or 9800GT are working fine also with 4G decoding active in graphic mode.
The solution was first to clear the CMOS, the to set the new BIOS setings and to install the Ubuntu 20.04 with 9800GT with original nvidia drivers 304 (without nouveau). Then the computer was turned off and 9800 replaced by GTX780.
I will procced now with installing 435 drivers with the 10.0 Cuda.

Topic		Replies	Views
IBM Power8: CUDA driver version is insufficient for CUDA runtime version CUDA Setup and Installation	9	3133	December 1, 2016
Ubuntu, CUDA 9, dual GTX1070, both (either) recognized, but can only initialize/use one CUDA Setup and Installation	2	1332	August 2, 2018
Windows 7 no CUDA-capable device is detected CUDA Setup and Installation	23	19261	January 9, 2018
Driver Installation for Tesla K80 - Problems CUDA Setup and Installation	17	6390	January 18, 2020
one CUDA card unrecognized in 64bit Win7 CUDA Programming and Performance	5	1697	April 15, 2011
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64216	April 20, 2011
bandwidthTest example throws cudaErrorCallRequiresNewerDriver error when launched via nv-nsight-cu-cli Nsight Compute linux , driver	17	1295	February 9, 2024
PCI passthrough KVM for CUDA usage CUDA Setup and Installation	6	6642	April 5, 2016
deviceQuery passes and then fails CUDA Setup and Installation	4	2137	July 6, 2016
K80 crashed or wrong computation results on K80 CUDA Programming and Performance	13	4959	September 20, 2015

Problems with CUDA drivers for NVIDIA Hardware

When I start deviceQueryDrv I get the following printout:

Related topics