Not all cuda devices detected in cuda fortran Windows 10

john_dunbar · May 23, 2017, 6:42pm

I have been trying to get back to compiling codes with pgf90 after upgrading to Windows 10 from Windows 7 and to pgfortran 17. After the upgrades single-gpu versions of my cuda fortran codes compile and run correctly, but not my multi-GPU versions. Cuda only detects two of the three cuda devices installed in my machine. Prior to the upgrade I had running multi-gpu pgf90 codes and would really like to get back there.

I have three cuda devices installed: a Nvidia Quadro 600, I use for graphics and two Tesla C2075s. The three devices all show up in the system’s Device Manager and all show to be working properly. Prior to the upgrades these devices appeared as devices 1, 0, and 2, respectively in cuda programs. That is, one C2075 showed up as device 0 and the other as device 2. Now device 2 is not detected. By “not detected”, I mean that cudaGetDeviceCount(ndevice) returns 2, rather and 3 as before. When I check the device properties, device 0 is the first C2075 as before, the Quadro is device 1,as before, and there is no device 2. I have tried cudasetdevice(2) to see if the device is there, but is not being detected. It returns an error code 10. I have re-installed pgf90 several times with different choices, with no change.

My machine as an Asus Sabertooth X58 mb, with 3 PCIx16 slots, running an i7 6-core and 24 gb ram. The BIOS has only one PCI setting, which switches between plug-and-play and non-plug-and-play modes. I have tried both settings.

Does anyone have any suggestions?

[/i]

tull · May 23, 2017, 8:05pm

What is the output of

pgaccelinfo

That routine should indicate all the GPUs we can detect.
If it is not showing all the GPUs, I would first look to make sure the
CUDA ‘drivers’ have been installed since the upgrade to Win10.
The Drivers come from Nvidia, and with them you should be able
to install and verify the GPU I present with Nvidia SW.

Then run
pgaccelinfo
which calls the same Nvidia routines you refer to.

PGI compilers come after the hardware works - PGI does not diagnose GPU hardware issues.

john_dunbar · May 23, 2017, 11:04pm

Thanks for your input. The response to pgaccelinfo is, as you would expect:

CUDA Driver Version: 8000

Device Number: 0
Device Name: Tesla C2075
Device Revision Number: 2.0
Global Memory Size: 5574492160
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 1566 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Managed Memory: No
PGI Compiler Option: -ta=tesla:cc20

Device Number: 1
Device Name: Quadro 600
Device Revision Number: 2.1
Global Memory Size: 1073741824
Number of Multiprocessors: 2
Number of Cores: 64
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1280 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 800 MHz
Memory Bus Width: 128 bits
L2 Cache Size: 131072 bytes
Max Threads Per SMP: 1536
Async Engines: 1
Unified Addressing: Yes
Managed Memory: No
PGI Compiler Option: -ta=tesla:cc20

So all the pg software agrees. Under Windows 10 and pg17, it can only see two out of three of the installed GPUs. Windows 10 sees three working GPUs. Just a few weeks ago, Windows 7 and the older version of pg fortran could see all three and use the two C2075s. I guess it is possible that the second C2075 failed in the last few weeks or that the mb has developed a problem with that third slot, that Window 10 somehow cannot detect either one. However, it seems far more likely that it is a software problem. My guess is that it is something to do with Windows 10 and how it interacts with the PCI bus compared to Windows 7. I do not think its a pg software problem either. I was hoping to find someone else who had run into this problem, and hopefully how they solved it. My next step will be to shift the positions of the two C2075s to see if I am write that the cards are fine. If that works, I will shift both C2075s to another computer to see if it the mb.

john_dunbar · May 24, 2017, 2:11pm

I failed to mention that the machine on which I am running Windows 10, that can only see two out of the three installed cuda devices, is set up as a dual-boot machine with Ubuntu 14.04. I have a license for pgf90 for Linux as well as Windows, but had not tried the Linux version. My plan was to develop code under MS Studio/pgf90 on the Windows side and then re-compile under Linux to run compute jobs on Linux compute servers.

Because I was stuck on getting multi-gpu to work under Windows, I gave the Linux version of pgf90 a try. Everything works fine under Linux. Cuda sees all three devices and compiles multi-gpu applications without a problem. So now I have ruled out hardware as a problem on the Windows side. I have the latest Nvidia drivers installed for both operating systems and have the latest updates on both operating systems. I have also tried all the PCI settings available in the machines BIOS. My conclusion is that Windows 10 is doing something with the PCI bus that is different than both Ubuntu 14.04 and Windows 7 that causes the third cuda device to be unusable. The question is, is this a fundamental limitation of Windows 10 or is it a setting issue? I will leave that for further research. For now my plan is to develop under Ubuntu. This means I am out about $1000 for licenses for pg & MS Studio 2015 that I am not able to use. Live and learn.

Problem solved - sort of.

tull · May 24, 2017, 4:47pm

Engineering suggests a driver issue. Please send the output
of

nvidia-smi.exe

which would typically be found at

C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe

Since everything is of compute capacity of 2.0/2.1,
we don’t think the problem is because you are mixing new
model GPUs with old ones.

dave

john_dunbar · May 24, 2017, 5:31pm

Thanks for your continued help. Even if this problem is not solved, I good to go using the Linux version and was thinking about going that way eventually anyway.

The output of NVidia-smi is as follows:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

I got the 377.35 driver from Nvidia’s website last week.

The strange thing is that this command detects both C2075. Also, the order it detects them in is different than the device numbering in cuda, with the Quadro as 0, followed by the two C2075s. This by the way, is the order in which they are installed in the slots.

tull · May 24, 2017, 8:23pm

The different numbering of the GPUs is not a puzzle,
as the driver and pgaccelinfo look in a different order.

But the Windows OpenACC not finding the third GPU is a puzzle,
and we are looking at it.

dave

john_dunbar · May 24, 2017, 9:31pm

Let me know if I can help with more information, tests, etc. Count me in.

tull · May 25, 2017, 6:29pm

Send the outputs of

pgaccelinfo -dev 0
pgaccelinfo -dev 1
pgaccelinfo -dev 2

on windows and Linux.

You may also want to send trs@pgroup.com the following outputs

On Linux
/sbin/ifconfig

on Windows
ipconfig /all

which may show some differences.

john_dunbar · May 26, 2017, 2:26pm

On Windows pgaccelinfo -dev 0 produces:

CUDA Driver Version: 8000

Device Number: 0
Device Name: Tesla C2075
Device Revision Number: 2.0
Global Memory Size: 5574492160
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 1566 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Managed Memory: No
PGI Compiler Option: -ta=tesla:cc20

On Linux: pgaccelinfo -dev 0 produces:

CUDA Driver Version: 8000
NVRM version: NVIDIA UNIX x86_64 Kernel
Module 375.66 Mon May 1 15:29:16 PDT 2017

Device Number: 0
Device Name: Tesla C2075
Device Revision Number: 2.0
Global Memory Size: 5558763520
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 1566 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Managed Memory: No
PGI Compiler Option: -ta=tesla:cc20

On Windows pgaccelinfo -dev 1 produces:

CUDA Driver Version: 8000

Device Number: 1
Device Name: Quadro 600
Device Revision Number: 2.1
Global Memory Size: 1073741824
Number of Multiprocessors: 2
Number of Cores: 64
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1280 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 800 MHz
Memory Bus Width: 128 bits
L2 Cache Size: 131072 bytes
Max Threads Per SMP: 1536
Async Engines: 1
Unified Addressing: Yes
Managed Memory: No
PGI Compiler Option: -ta=tesla:cc20

On Linux pgaccelinfo -dev 1 produces:

CUDA Driver Version: 8000
NVRM version: NVIDIA UNIX x86_64 Kernel
Module 375.66 Mon May 1 15:29:16 PDT 2017
Device Number: 1
Device Name: Quadro 600
Device Revision Number: 2.1
Global Memory Size: 1010958336
Number of Multiprocessors: 2
Number of Cores: 64
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1280 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 800 MHz
Memory Bus Width: 128 bits
L2 Cache Size: 131072 bytes
Max Threads Per SMP: 1536
Async Engines: 1
Unified Addressing: Yes
Managed Memory: No
PGI Compiler Option: -ta=tesla:cc20

On Windows pgaccelinfo -dev 2 produces:

CUDA Driver Version: 8000
Device Number: 2
could not attach to this device

On Linux pgaccelinfo -dev 2 produced:

CUDA Driver Version: 8000
NVRM version: NVIDIA UNIX x86_64 Kernel
Module 375.66 Mon May 1 15:29:16 PDT 2017
Device Number: 2
Device Name: Tesla C2075
Device Revision Number: 2.0
Global Memory Size: 5558763520
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 1566 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Managed Memory: No
PGI Compiler Option: -ta=tesla:cc20

On Windows ipconfig /all produces:

Windows IP Configuration

Host Name . . . . . . . . . . . . : BigBoy
Primary Dns Suffix . . . . . . . :
Node Type . . . . . . . . . . . . : Hybrid
IP Routing Enabled. . . . . . . . : No
WINS Proxy Enabled. . . . . . . . : No

Wireless LAN adapter Local Area Connection* 2:

Media State . . . . . . . . . . . : Media disconnected
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Microsoft Wi-Fi Direct Virtual Adapter
Physical Address. . . . . . . . . : 56-A0-50-70-AD-B5
DHCP Enabled. . . . . . . . . . . : Yes
Autoconfiguration Enabled . . . . : Yes

Wireless LAN adapter Wireless Network Connection 3:

Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : The ASUS 802.11 Network Adapter provides wireless local area networking.
Physical Address. . . . . . . . . : 54-A0-50-70-AD-B5
DHCP Enabled. . . . . . . . . . . : Yes
Autoconfiguration Enabled . . . . : Yes
IPv6 Address. . . . . . . . . . . : ::e133:1c0e:2e40:b306(Preferred)
Temporary IPv6 Address. . . . . . : ::9024:a600:c29f:ab29(Preferred)
Link-local IPv6 Address . . . . . : fe80::e133:1c0e:2e40:b306%11(Preferred)
IPv4 Address. . . . . . . . . . . : 192.168.0.11(Preferred)
Subnet Mask . . . . . . . . . . . : 255.255.255.0
Lease Obtained. . . . . . . . . . : Friday, May 26, 2017 7:31:51 AM
Lease Expires . . . . . . . . . . : Friday, May 26, 2017 8:32:15 AM
Default Gateway . . . . . . . . . : 192.168.0.1
DHCP Server . . . . . . . . . . . : 192.168.0.1
DHCPv6 IAID . . . . . . . . . . . : 475308112
DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-1D-E3-DB-5E-BC-AE-C5-56-41-9B
DNS Servers . . . . . . . . . . . : 216.82.201.11
66.90.130.10
NetBIOS over Tcpip. . . . . . . . : Enabled

Tunnel adapter isatap.{612DC874-754E-4D3F-9F63-CEBD30BBCD38}:

Media State . . . . . . . . . . . : Media disconnected
Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Microsoft ISATAP Adapter
Physical Address. . . . . . . . . : 00-00-00-00-00-00-00-E0
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes

Tunnel adapter Local Area Connection* 11:

Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Microsoft Teredo Tunneling Adapter
Physical Address. . . . . . . . . : 00-00-00-00-00-00-00-E0
DHCP Enabled. . . . . . . . . . . : No
Autoconfiguration Enabled . . . . : Yes
IPv6 Address. . . . . . . . . . . : 2001:0:4137:9e76:ce7:1f1f:bda5:34ec(Preferred)
Link-local IPv6 Address . . . . . : fe80::ce7:1f1f:bda5:34ec%3(Preferred)
Default Gateway . . . . . . . . . :
DHCPv6 IAID . . . . . . . . . . . : 50331648
DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-1D-E3-DB-5E-BC-AE-C5-56-41-9B
NetBIOS over Tcpip. . . . . . . . : Disabled

On Linux ifconfig produces:

eth0
Link encap:Ethernet
HWaddr bc:ae:c5:56:46:47
UP BROADCAST MULTICAST MTU:1500
Metric:1
RX packets:0
errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0
txqueuelen:1000
RX bytes:0 (0.0 B)
TX bytes:0 (0.0 B)
Interrupt:18

eth1
Link encap:Ethernet
HWaddr bc:ae:c5:56:41:9b
UP BROADCAST MULTICAST MTU:1500
Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B)
TX bytes:0 (0.0 B)
Interrupt:17 lo
Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128
Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:563 errors:0 dropped:0 overruns:0 frame:0
TX packets:563 errors:0 dropped:0 overruns:0 carrier:
collisions:0 txqueuelen:0
RX bytes:89930 (89.9 KB)
TX bytes:89930 (89.9 KB)
wlan0 Link encap:Ethernet
HWaddr 54:a0:50:70:ad:b5
inet addr:192.168.0.11
Bcast:192.168.0.255
Mask:255.255.255.0
inet6 addr: ::c08b:adb0:dd38:e5b8/64 Scope:Global
inet6 addr: fe80::56a0:50ff:fe70:adb5/64
Scope:Link
inet6 addr: ::56a0:50ff:fe70:adb5/64 Scope:Global
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:8551 errors:0 dropped:0 overruns:0 frame:757
TX packets:4205 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:8698762 (8.6 MB)
TX bytes:608858 (608.8 KB)
Interrupt:16 Base address:0x8000

One question I have, has anyone you are award of been able to access more than two cuda devices under Windows 10 Professional?

caseroj · November 11, 2018, 9:50pm

I am experiencing the same problem. I only have two GPU’s on Tesla C2075 and one GeForce GT710. Here is the output of pgaccelfino

CUDA Driver Version: 9010

Device Number: 1
Device Name: GeForce GT 710
Device Revision Number: 3.5
Global Memory Size: 2147483648
Number of Multiprocessors: 1
Number of SP Cores: 192
Number of DP Cores: 64
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 954 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 2505 MHz
Memory Bus Width: 64 bits
L2 Cache Size: 524288 bytes
Max Threads Per SMP: 2048
Async Engines: 1
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
PGI Compiler Option: -ta=tesla:cc35

PGI$ pgaccelinfo -dev 0

CUDA Driver Version: 9010

Notice that it doesn’t report anything significant about device 0. Here is the output of my nvidia-smi.exe command

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 Not Supported |
±----------------------------------------------------------------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 Not Supported |
±----------------------------------------------------------------------------+

C:\Program Files\NVIDIA Corporation\NVSMI>

Nvidia recognizes both GPU controllers but the pgi tools only sees the low end GT710 video card I have connected to some monitors. In fact it is the C2075 that I most care about and most want to use for HPC. Any advice on what can done to fix this?

aglobus1 · November 19, 2018, 10:37pm

caseroj,

I’m not too familiar with the issue, but it looks like you have C2075 set in TCC mode (Reference Topics :: NVIDIA Nsight VSE Documentation). I don’t think there are any limitations with TCC and PGI as far as I’m aware, but can pgaccelinfo see the card if you switch it to use WDDM? You can do so with nvidia-smi:

nvidia-smi -g {GPU_ID} -dm {0|1}

Where 0 = WDDM and 1 = TCC. Use -fdm instead of -dm to force it. Though I think you should leave the GPU driving the display to use WDDM.

Topic		Replies	Views
cuda fortran questions Legacy PGI Compilers	10	10941	July 27, 2012
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64030	April 20, 2011
trying to get a tesla k10 online. cuda_5.5.22_linux_64.run fails Linux	18	5782	February 16, 2014
bandwidthTest example throws cudaErrorCallRequiresNewerDriver error when launched via nv-nsight-cu-cli Nsight Compute linux , driver	17	1245	February 9, 2024
Runtime problem with PGFORTRAN Linux	40	1156	October 7, 2019
Why is Nvidia K10 identifying as Quadro CentOS 7 w/ Cuda 7-8* Linux	9	1750	August 18, 2017
Cudnn_status_not_initialized Linux cuda , ubuntu	16	6054	March 18, 2021
GTX295 Specefications & CUDA CUDA Programming and Performance	5	12269	October 7, 2010
CUDA 4.0 CUDA Programming and Performance	63	507390	March 28, 2013
K80 crashed or wrong computation results on K80 CUDA Programming and Performance	13	4932	September 20, 2015

Not all cuda devices detected in cuda fortran Windows 10

Related topics