cudaGetDeviceCount not detecting multiple cards

calvar · September 7, 2015, 3:14pm

Good day,

I have a workstation with Ubuntu linux installed (14.04) and two nvidia cards:

-Quadro NVS 310
-Tesla K20

If I type:

lspci | grep NVIDIA

I get

03:00.0 VGA compatible controller: NVIDIA Corporation GF119 [NVS 310] (rev a1)
03:00.1 Audio device: NVIDIA Corporation GF119 HDMI Audio Controller (rev a1)
04:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20c] (rev a1)

So I know the cards is recognized. However, if I use

cudaGetDeviceCount

function I get just one card listed and

cudaGetDeviceProperties

shows only the Quadro card:

Device Number: 0
  Device name: NVS 310
  Memory Clock Rate (KHz): 875000
  Memory Bus Width (bits): 64
  Peak Memory Bandwidth (GB/s): 14.000000

The code I use is:

int nDevices;

  cudaGetDeviceCount(&nDevices);
  cout << nDevices << " cards.\n";
  for (int i = 0; i < nDevices; i++) {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, i);
    printf("Device Number: %d\n", i);
    printf("  Device name: %s\n", prop.name);
    printf("  Memory Clock Rate (KHz): %d\n",
           prop.memoryClockRate);
    printf("  Memory Bus Width (bits): %d\n",
           prop.memoryBusWidth);
    printf("  Peak Memory Bandwidth (GB/s): %f\n\n",
           2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);
  }

Can someone help me with this?

Thank you :)

Robert_Crovella · September 7, 2015, 3:20pm

what is the output of:

nvidia-smi

and:

sudo dmesg |grep NVRM

?

calvar · September 7, 2015, 3:29pm

Hi, and thank you for the reply.

nvidia-smi:

+------------------------------------------------------+                       
| NVIDIA-SMI 346.82     Driver Version: 346.82         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVS 310             Off  | 0000:03:00.0     N/A |                  N/A |
| 30%   52C    P8    N/A /  N/A |    228MiB /   511MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0              C   Not Supported                                         |
+-----------------------------------------------------------------------------+

dmesg |grep NVRM:

[   17.159543] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  346.82  Wed Jun 17 10:37:46 PDT 2015
[   22.363096] NVRM: RmInitAdapter failed! (0x25:0x1c:1270)
[   22.363106] NVRM: rm_init_adapter failed for device bearing minor number 1
[   22.363130] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5
[  472.740284] NVRM: RmInitAdapter failed! (0x25:0x1c:1270)
[  472.740297] NVRM: rm_init_adapter failed for device bearing minor number 1
[  472.740327] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5
[14278.115520] NVRM: RmInitAdapter failed! (0x25:0x40:1270)
[14278.115542] NVRM: rm_init_adapter failed for device bearing minor number 1
[14278.115572] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5
[17662.200736] NVRM: RmInitAdapter failed! (0x25:0x40:1270)
[17662.200755] NVRM: rm_init_adapter failed for device bearing minor number 1
[17662.200788] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5
[17970.566928] NVRM: RmInitAdapter failed! (0x25:0x40:1270)
[17970.566949] NVRM: rm_init_adapter failed for device bearing minor number 1
[17970.566980] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5
[20196.492173] NVRM: RmInitAdapter failed! (0x25:0x40:1270)
[20196.492192] NVRM: rm_init_adapter failed for device bearing minor number 1
[20196.492224] NVRM: nvidia_frontend_open: minor 1, module->open() failed, error -5

Robert_Crovella · September 7, 2015, 3:40pm

Which CUDA version did you install?
Did you follow the instructions in the linux getting started guide?
Which method (package manager or runfile installer) did you use to install CUDA?

At any time did you switch methods between package manager and installer?

Did you install the driver separately from installing CUDA?

Did you remove the nouveau driver?

calvar · September 7, 2015, 3:54pm

Originally I shot down lightdm, then I removed (purged) all nvidia-* and installed the cuda 7.0 toolkit,
and finally added

export CUDA_HOME=/usr/local/cuda-7.0
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64

PATH=${CUDA_HOME}/bin:${PATH}
export PATH

to the .bashrc file

Robert_Crovella · September 7, 2015, 4:04pm

If you want my help, please answer my questions.

calvar · September 7, 2015, 4:23pm

Hi,

version 7.0

Yes

local package installer (dpkg -i)

No.

No

Robert_Crovella · September 7, 2015, 4:31pm

what is the output from:

sudo lspci -vvv |grep -A 20 -i nvidia

and

sudo lsmod |grep -i nvidia

and

sudo lsmod |grep -i nouveau

?

calvar · September 7, 2015, 4:43pm

sudo lspci -vvv |grep -A 20 -i nvidia:

03:00.0 VGA compatible controller: NVIDIA Corporation GF119 [NVS 310] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device 094e
	Physical Slot: 2
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 86
	Region 0: Memory at f8000000 (32-bit, non-prefetchable) 
	Region 1: Memory at e8000000 (64-bit, prefetchable) 
	Region 3: Memory at f0000000 (64-bit, prefetchable) 
	Region 5: I/O ports at d000 
	[virtual] Expansion ROM at f9000000 [disabled] 
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee00718  Data: 0000
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
--
	Kernel driver in use: nvidia

03:00.1 Audio device: NVIDIA Corporation GF119 HDMI Audio Controller (rev a1)
	Subsystem: NVIDIA Corporation Device 094e
	Physical Slot: 2
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin B routed to IRQ 36
	Region 0: Memory at f9080000 (32-bit, non-prefetchable) 
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <256ns, L1 <4us
			ClockPM+ Surprise- LLActRep- BwNot-
--
04:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20c] (rev a1)
	Subsystem: NVIDIA Corporation Device 0982
	Physical Slot: 4
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 40
	Region 0: Memory at fa000000 (32-bit, non-prefetchable) 
	Region 1: Memory at d0000000 (64-bit, prefetchable) 
	Region 3: Memory at e0000000 (64-bit, prefetchable) 
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 00000000fee00738  Data: 0000
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
--
	Kernel driver in use: nvidia

06:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
	Subsystem: Dell Device 0619
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 19
	Region 0: Memory at fb500000 (32-bit, non-prefetchable) 
	Region 2: I/O ports at c000 
	Region 3: Memory at fb580000 (32-bit, non-prefetchable) 
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
		Address: 0000000000000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
		Vector table: BAR=3 offset=00000000
		PBA: BAR=3 offset=00002000
	Capabilities: [a0] Express (v2) Endpoint, MSI 00

sudo lsmod |grep -i nvidia:

nvidia               8379750  43 
drm                   303102  2 nvidia

sudo lsmod |grep -i nouveau: Nothing appears.
However the following packages appear installed using the synaptic package manager:

xserver-xorg-video-nouveau-lts-quantal
xserver-xorg-video-nouveau
libdrm-nouveau2

Robert_Crovella · September 7, 2015, 5:02pm

Well, I don’t see any problems except the obvious ones reported by nvidia-smi and dmesg: the driver is not happy with the K20c.

Its remotely possible that it might be a defective motherboard or GPU at this point. Or something else that I am unaware of. But my guess right now is a broken driver install.

If it were my system, I would start with a clean build of ubuntu 14.04 and just install the driver, and use the runfile installer method. Effectively this means downloading and using a driver installer from nvidia.com (not apt). If that does not work for some reason, the driver installer log may provide some clues. If it does work, then you could move on to installing the CUDA toolkit via runfile installer method, or starting over with the package manager method and see if any mistakes were made along the way, or more closely analyzing what is going on and getting installed.

The lsmod output should have shown both an nvidia entry, and an nvidia_uvm entry. So I suspect a broken driver install at this point, although if you followed the instructions from the linux getting started guide, you should have gotten all the driver pieces. Trying to fix this using a package manager method would effectively mean removing or purging all the existing nvidia packages, and basically repeating the process that you already did. If you possibly made a mistake somewhere in that process, repeating the same process is not likely to produce a different result, which is why I suggest installing the driver via runfile installer.

If you decide to install the driver via runfile installer, be sure to follow the steps for removal of the nouveau driver. That is not just blacklisting, but also removal from the initrd image. These steps should work (as root):

echo -e "blacklist nouveau\noptions nouveau modeset=0"  > /etc/modprobe.d/disable-nouveau.conf
update-initramfs -u

calvar · September 7, 2015, 5:18pm

Ok, I’ll try the run installation. Thanks.

calvar · September 8, 2015, 1:43am

Ok. So I did the .run installation and nvidia.smi told me that the card was underpowered. So I opened the case and there it was, for some reason the person that assembled the workstation did not connect both power cables to the k20. Now nvidia.smi detects both cards.

Thank you for the help.

Topic		Replies	Views
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62320	February 14, 2021
Ubuntu Box with multiple NVIDIA GPU Cards CUDA Setup and Installation	5	11852	August 13, 2015
'No devices were found' after installing cuda 11.02 on Ubuntu 20.04 for RTX3080 Linux cuda , ubuntu , driver	19	12602	July 31, 2021
CUDA missing GPU CUDA Setup and Installation	6	6469	June 6, 2017
Install Problem CUDA Programming and Performance	32	12702	December 17, 2009
Nvidia command cannot see second GPU CUDA Setup and Installation cuda , ubuntu , nvbugs	1	2121	August 30, 2022
trying to get a tesla k10 online. cuda_5.5.22_linux_64.run fails Linux	18	5799	February 16, 2014
NVIDIA driver is not loaded. Ubuntu 18.10 Linux	310	129541	February 14, 2024
"no CUDA-capable device is detected" for CUDA ver 7.5, Kubuntu 14.04 CUDA Setup and Installation	4	2405	February 25, 2016
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	468	September 11, 2024

cudaGetDeviceCount not detecting multiple cards

Related topics