Device Enumeration and cudaSetDevice SDK Examples Failing to Run on Device 0, but run fine on Device

Paracelsus · August 23, 2011, 8:59pm

This is a variation of an issue discussed on other threads, but I’ve not seen a solution, and my environment is a bit different as well.

I have a system with two M2070s and an on-board (non-nvidia) video adapter. There is no X server / GDM running. There are no monitors plugged into the M2070s.

Nvidia Driver: 280.13

Cuda Toolkit: 4.0.17

SDK: 4.0.17

Distro: RHEL 6.1

If I run a simple array copy test kernel, and specify cudaSetDevice(0); it seg faults. If I use cudaSetDevice(1) (or actually, any value >1) it runs fine. That seems to indicate that cudaSetDevice is using different identifiers than what is returned by nvidia-smi -L which gives:

GPU 0: Tesla M2070

GPU 1: Tesla M2070

It seems extremely non-intuitive, and not correct that cudaSetDevice would behave this way. Additionally, code examples compiled with no GPU set (and therefor running on GPU 0), as well as those which choose and set the GPU, such as the SDK matrixMul, either seg fault when cudaMalloc is called, or fail with “all CUDA-capable devices are busy or unavailable” such as:

Example:

[ matrixMul ]

/usr/local/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/matrixMul Starting (CUDA and CUBLAS tests)...

Device 0: "Tesla M2070" with Compute 2.0 capability

Using Matrix Sizes: A(640 x 960), B(640 x 640), C(640 x 960)

matrixMul.cu(151) : cudaSafeCall() Runtime API error 46: all CUDA-capable devices are busy or unavailable.

Driver

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  280.13  Wed Jul 27 16:53:56 PDT 2011

GCC version:  gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC)

The devices in /proc are:

cat /proc/driver/nvidia/gpus/0/information 

Model:           Tesla M2070

IRQ:             24

Video BIOS:      ??.??.??.??.??

Card Type:       PCI-E

DMA Size:        39 bits

DMA Mask:        0x7fffffffff

Bus Location:    0000:02.00.0

cat /proc/driver/nvidia/gpus/1/information 

Model:           Tesla M2070

IRQ:             30

Video BIOS:      ??.??.??.??.??

Card Type:       PCI-E

DMA Size:        39 bits

DMA Mask:        0x7fffffffff

Bus Location:    0000:03.00.0

So, that bottom line:

If nvidia-smi, and /proc, show the GPUs and 0 and 1, why do I have to specify 1 or 2 in cudaSetDevice() in order to choose a valid device? (Which also means many SDK examples fail to run as they default to device 0 when two identical devices are present.)

Any hints on how to straighten this out so that cudaSetDevice uses 0 and 1?

Thank you,

Pete

tmurray · August 23, 2011, 9:10pm

what does deviceQuery print?

Paracelsus · August 23, 2011, 9:25pm

Hi there,

Thanks for the help:

[deviceQuery] starting...

~NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "Tesla M2070"

  CUDA Driver Version / Runtime Version          4.0 / 4.0

  CUDA Capability Major/Minor version number:    2.0

  Total amount of global memory:                 5375 MBytes (5636554752 bytes)

  (14) Multiprocessors x (32) CUDA Cores/MP:     448 CUDA Cores

  GPU Clock Speed:                               1.15 GHz

  Memory Clock rate:                             1566.00 Mhz

  Memory Bus Width:                              384-bit

  L2 Cache Size:                                 786432 bytes

. . . 

Device 1: "Tesla M2070"

  CUDA Driver Version / Runtime Version          4.0 / 4.0

  CUDA Capability Major/Minor version number:    2.0

  Total amount of global memory:                 5375 MBytes (5636554752 bytes)

  (14) Multiprocessors x (32) CUDA Cores/MP:     448 CUDA Cores

  GPU Clock Speed:                               1.15 GHz

  Memory Clock rate:                             1566.00 Mhz

  Memory Bus Width:                              384-bit

  L2 Cache Size:                                 786432 bytes

. . .

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 2, Device = Tesla M2070, Device = Tesla M2070

[deviceQuery] test results...

PASSED

Cheers,

Pete

Paracelsus · August 24, 2011, 2:16pm

I decided to reinstall the devdriver_4.0_linux_64_270.41.19 driver, which does work better with nvidia-smi. (The 280.13 driver does not return nvidia-smi results well.)

Set persistence mode on both M2070s: nvidia-smi -pm 1

I blacklisted nouveau, reinstalled devdriver_4.0_linux_64_270.41.19 and gpucomputingsdk_4.0.17_linux and recompiled all the examples.

Sadly, I still can’t run most SDK example, they just core dump.

[bandwidthTest] starting...

~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest Starting...

Running on...

Device 0: Tesla M2070

 Quick Mode

Segmentation fault (core dumped)

Pete

Paracelsus · August 24, 2011, 3:43pm

I discovered on this thread The Official NVIDIA Forums | NVIDIA

that you can set the environment variable CUDA_VISIBLE_DEVICES to selectively mask a GPU in a multi-GPU system. I can therefor export CUDA_VISIBLE_DEVICES=1 and all the SDK examples will run.

I am starting to wonder more if there is an actual issue with one of the GPUs.

$ export CUDA_VISIBLE_DEVICES=0

$ NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest 

[bandwidthTest] starting...

NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest Starting...

Running on...

Device 0: Tesla M2070

 Quick Mode

Segmentation fault (core dumped)

$ export CUDA_VISIBLE_DEVICES=1

$ NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest 

[bandwidthTest] starting...

NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest Starting...

Running on...

Device 0: Tesla M2070

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     3616.5

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     3209.6

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     83883.1

[bandwidthTest] test results...

PASSED

Press ENTER to exit..

(Odd that it reports it is running on device 0, when CUDA_VISIBLE_DEVICES=1 was used?)

I find similar things happens when I run a simple kernel that copies an array to the device. When run on GPU0 it segfaults, but not on GPU1

$ export CUDA_VISIBLE_DEVICES=0

$ ./arraycp-gpu0 

Segmentation fault (core dumped)

$ export CUDA_VISIBLE_DEVICES=1

$ ./arraycp-gpu0

cuda-gdb shows the seg fault occurs when cudaSetDevice(0); is called (if explicitly called).

If cudaSetDevice is not explicitly called, cuda-gdb shows it hangs on cudaMalloc:

Core was generated by `./arraycp-gpu0'.

Program terminated with signal 11, Segmentation fault.

#0  0x00007f9557cb7903 in ?? () from /usr/lib64/libcuda.so

(cuda-gdb) list

6       #include <stdio.h>

7       #include <assert.h>

8       #include <cuda.h>

9       int main(void)

10      {

11

12         // cudaSetDevice(0);

13         float *a_h, *b_h;     // pointers to host memory

14         float *a_d, *b_d;     // pointers to device memory

15         int N = 14

(cuda-gdb) break 15

Breakpoint 1 at 0x40083d: file arraycp.cu, line 15.

(cuda-gdb) run

Starting program: ~/gpuSegFault/arraycp-gpu0 

BFD: /lib64/libc.so.6: invalid relocation type 37

BFD: BFD 2.17.50 assertion fail /home/buildmeister/build/rel/gpgpu/toolkit/r4.0/debugger/cuda-gdb/bfd/elf64-x86-64.c:259

BFD: /lib64/libc.so.6: invalid relocation type 37

BFD: BFD 2.17.50 assertion fail /home/buildmeister/build/rel/gpgpu/toolkit/r4.0/debugger/cuda-gdb/bfd/elf64-x86-64.c:259

[Thread debugging using libthread_db enabled]

[New process 4593]

[New Thread 140486779303712 (LWP 4593)]

[Switching to Thread 140486779303712 (LWP 4593)]

Breakpoint 1, main () at arraycp.cu:15

15         int N = 14;

(cuda-gdb) n

18         a_h = (float *)malloc(sizeof(float)*N);

(cuda-gdb) n

19         b_h = (float *)malloc(sizeof(float)*N);

(cuda-gdb) n

21         cudaMalloc((void **) &a_d, sizeof(float)*N);

(It hangs in gdb here forever)

Does anyone have any ideas what could be causing this? Faulty GPU or something wrong with my environment?

Pete

Paracelsus · August 25, 2011, 7:19pm

Solved!

I had been running all my debugging remotely, I ran it on the console and saw this error:

NVRM: Xid (0000:02:00):48, An uncorrectable double bit error (DBE) has been detected on GPU0 (00 00 00).

Looking at /var/log/messages I see this was reported during my earlier tests as well, while nothing gets sent to stderr, just the segfault message of course. (Which makes sense of course as the Xid messages are from the Nvidia kernel module.) Too bad I didn’t know the module logged such information.

This seemed to indicate a possible bad GPU, or PCIe slot, so I swapped the two GPUs around. In doing so, it appears that one of the 8 pin power harnesses, which was actually a 6+2 pin harness, did not have the 2 pin harness connected, and so the GPU was slightly underpowered. At least that seems to be the case.

Since swapping the GPUs around, everything is running great.

From the driver readme at: Chapter 7. Frequently Asked Questions

[i]My kernel log contains messages that are prefixed with “Xid”; what do these messages mean?

“Xid” messages indicate that a general GPU error occurred, most often due to the driver misprogramming the GPU or to corruption of the commands sent to the GPU. These messages provide diagnostic information that can be used by NVIDIA to aid in debugging reported problems.

[/i]

As a note, it would be really, really great if Nvidia made available a reference on what the various Xid error codes mean.

Pete

Topic		Replies	Views
Different performance from different GPUs with Identical Code CUDA Programming and Performance	18	4365	April 11, 2012
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64311	April 20, 2011
Got out of memory from cudaMemcpy CUDA Programming and Performance	13	3997	January 28, 2022
Quick Question on cudaSetDevice()? It does not work in my case. CUDA Programming and Performance	5	11666	November 20, 2009
Install Problem CUDA Programming and Performance	32	12706	December 17, 2009
cudaGetDeviceCount returned 100 -> no CUDA-capable device is detected CUDA Setup and Installation	13	7304	April 16, 2019
Speed problem on 295 gtx cards CUDA Programming and Performance	19	10484	January 8, 2010
compute-exclusive mode and cudaGetDevice(...) Always claims to be running on device 0. CUDA Programming and Performance	6	15904	July 27, 2009
Failed call to cuInit CUDA_ERROR_NOT_INITIALIZED (Device mapping: no known devices) CUDA Setup and Installation	7	6399	November 27, 2018
SDK examples stopped working CUDA Programming and Performance	2	1427	June 23, 2012

Device Enumeration and cudaSetDevice SDK Examples Failing to Run on Device 0, but run fine on Device

Related topics