Device Enumeration and cudaSetDevice SDK Examples Failing to Run on Device 0, but run fine on Device

This is a variation of an issue discussed on other threads, but I’ve not seen a solution, and my environment is a bit different as well.

I have a system with two M2070s and an on-board (non-nvidia) video adapter. There is no X server / GDM running. There are no monitors plugged into the M2070s.

Nvidia Driver: 280.13

Cuda Toolkit: 4.0.17

SDK: 4.0.17

Distro: RHEL 6.1

If I run a simple array copy test kernel, and specify cudaSetDevice(0); it seg faults. If I use cudaSetDevice(1) (or actually, any value >1) it runs fine. That seems to indicate that cudaSetDevice is using different identifiers than what is returned by nvidia-smi -L which gives:

GPU 0: Tesla M2070

GPU 1: Tesla M2070

It seems extremely non-intuitive, and not correct that cudaSetDevice would behave this way. Additionally, code examples compiled with no GPU set (and therefor running on GPU 0), as well as those which choose and set the GPU, such as the SDK matrixMul, either seg fault when cudaMalloc is called, or fail with “all CUDA-capable devices are busy or unavailable” such as:

Example:

[ matrixMul ]

/usr/local/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/matrixMul Starting (CUDA and CUBLAS tests)...

Device 0: "Tesla M2070" with Compute 2.0 capability

Using Matrix Sizes: A(640 x 960), B(640 x 640), C(640 x 960)

matrixMul.cu(151) : cudaSafeCall() Runtime API error 46: all CUDA-capable devices are busy or unavailable.

Driver

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  280.13  Wed Jul 27 16:53:56 PDT 2011

GCC version:  gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC)

The devices in /proc are:

cat /proc/driver/nvidia/gpus/0/information 

Model:           Tesla M2070

IRQ:             24

Video BIOS:      ??.??.??.??.??

Card Type:       PCI-E

DMA Size:        39 bits

DMA Mask:        0x7fffffffff

Bus Location:    0000:02.00.0

cat /proc/driver/nvidia/gpus/1/information 

Model:           Tesla M2070

IRQ:             30

Video BIOS:      ??.??.??.??.??

Card Type:       PCI-E

DMA Size:        39 bits

DMA Mask:        0x7fffffffff

Bus Location:    0000:03.00.0

So, that bottom line:

If nvidia-smi, and /proc, show the GPUs and 0 and 1, why do I have to specify 1 or 2 in cudaSetDevice() in order to choose a valid device? (Which also means many SDK examples fail to run as they default to device 0 when two identical devices are present.)

Any hints on how to straighten this out so that cudaSetDevice uses 0 and 1?

Thank you,

Pete

what does deviceQuery print?

Hi there,

Thanks for the help:

[deviceQuery] starting...

~NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "Tesla M2070"

  CUDA Driver Version / Runtime Version          4.0 / 4.0

  CUDA Capability Major/Minor version number:    2.0

  Total amount of global memory:                 5375 MBytes (5636554752 bytes)

  (14) Multiprocessors x (32) CUDA Cores/MP:     448 CUDA Cores

  GPU Clock Speed:                               1.15 GHz

  Memory Clock rate:                             1566.00 Mhz

  Memory Bus Width:                              384-bit

  L2 Cache Size:                                 786432 bytes

. . . 

Device 1: "Tesla M2070"

  CUDA Driver Version / Runtime Version          4.0 / 4.0

  CUDA Capability Major/Minor version number:    2.0

  Total amount of global memory:                 5375 MBytes (5636554752 bytes)

  (14) Multiprocessors x (32) CUDA Cores/MP:     448 CUDA Cores

  GPU Clock Speed:                               1.15 GHz

  Memory Clock rate:                             1566.00 Mhz

  Memory Bus Width:                              384-bit

  L2 Cache Size:                                 786432 bytes

. . .

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 2, Device = Tesla M2070, Device = Tesla M2070

[deviceQuery] test results...

PASSED

Cheers,

Pete

I decided to reinstall the devdriver_4.0_linux_64_270.41.19 driver, which does work better with nvidia-smi. (The 280.13 driver does not return nvidia-smi results well.)

Set persistence mode on both M2070s: nvidia-smi -pm 1

I blacklisted nouveau, reinstalled devdriver_4.0_linux_64_270.41.19 and gpucomputingsdk_4.0.17_linux and recompiled all the examples.

Sadly, I still can’t run most SDK example, they just core dump.

[bandwidthTest] starting...

~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest Starting...

Running on...

Device 0: Tesla M2070

 Quick Mode

Segmentation fault (core dumped)

Pete

I discovered on this thread The Official NVIDIA Forums | NVIDIA

that you can set the environment variable CUDA_VISIBLE_DEVICES to selectively mask a GPU in a multi-GPU system. I can therefor export CUDA_VISIBLE_DEVICES=1 and all the SDK examples will run.

I am starting to wonder more if there is an actual issue with one of the GPUs.

$ export CUDA_VISIBLE_DEVICES=0

$ NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest 

[bandwidthTest] starting...

NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest Starting...

Running on...

Device 0: Tesla M2070

 Quick Mode

Segmentation fault (core dumped)

$ export CUDA_VISIBLE_DEVICES=1

$ NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest 

[bandwidthTest] starting...

NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest Starting...

Running on...

Device 0: Tesla M2070

 Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     3616.5

Device to Host Bandwidth, 1 Device(s), Paged memory

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     3209.6

Device to Device Bandwidth, 1 Device(s)

   Transfer Size (Bytes)        Bandwidth(MB/s)

   33554432                     83883.1

[bandwidthTest] test results...

PASSED

Press ENTER to exit..

(Odd that it reports it is running on device 0, when CUDA_VISIBLE_DEVICES=1 was used?)

I find similar things happens when I run a simple kernel that copies an array to the device. When run on GPU0 it segfaults, but not on GPU1

$ export CUDA_VISIBLE_DEVICES=0

$ ./arraycp-gpu0 

Segmentation fault (core dumped)

$ export CUDA_VISIBLE_DEVICES=1

$ ./arraycp-gpu0

cuda-gdb shows the seg fault occurs when cudaSetDevice(0); is called (if explicitly called).

If cudaSetDevice is not explicitly called, cuda-gdb shows it hangs on cudaMalloc:

Core was generated by `./arraycp-gpu0'.

Program terminated with signal 11, Segmentation fault.

#0  0x00007f9557cb7903 in ?? () from /usr/lib64/libcuda.so

(cuda-gdb) list

6       #include <stdio.h>

7       #include <assert.h>

8       #include <cuda.h>

9       int main(void)

10      {

11

12         // cudaSetDevice(0);

13         float *a_h, *b_h;     // pointers to host memory

14         float *a_d, *b_d;     // pointers to device memory

15         int N = 14

(cuda-gdb) break 15

Breakpoint 1 at 0x40083d: file arraycp.cu, line 15.

(cuda-gdb) run

Starting program: ~/gpuSegFault/arraycp-gpu0 

BFD: /lib64/libc.so.6: invalid relocation type 37

BFD: BFD 2.17.50 assertion fail /home/buildmeister/build/rel/gpgpu/toolkit/r4.0/debugger/cuda-gdb/bfd/elf64-x86-64.c:259

BFD: /lib64/libc.so.6: invalid relocation type 37

BFD: BFD 2.17.50 assertion fail /home/buildmeister/build/rel/gpgpu/toolkit/r4.0/debugger/cuda-gdb/bfd/elf64-x86-64.c:259

[Thread debugging using libthread_db enabled]

[New process 4593]

[New Thread 140486779303712 (LWP 4593)]

[Switching to Thread 140486779303712 (LWP 4593)]

Breakpoint 1, main () at arraycp.cu:15

15         int N = 14;

(cuda-gdb) n

18         a_h = (float *)malloc(sizeof(float)*N);

(cuda-gdb) n

19         b_h = (float *)malloc(sizeof(float)*N);

(cuda-gdb) n

21         cudaMalloc((void **) &a_d, sizeof(float)*N);

(It hangs in gdb here forever)

Does anyone have any ideas what could be causing this? Faulty GPU or something wrong with my environment?

Pete

Solved!

I had been running all my debugging remotely, I ran it on the console and saw this error:

NVRM: Xid (0000:02:00):48, An uncorrectable double bit error (DBE) has been detected on GPU0 (00 00 00).

Looking at /var/log/messages I see this was reported during my earlier tests as well, while nothing gets sent to stderr, just the segfault message of course. (Which makes sense of course as the Xid messages are from the Nvidia kernel module.) Too bad I didn’t know the module logged such information.

This seemed to indicate a possible bad GPU, or PCIe slot, so I swapped the two GPUs around. In doing so, it appears that one of the 8 pin power harnesses, which was actually a 6+2 pin harness, did not have the 2 pin harness connected, and so the GPU was slightly underpowered. At least that seems to be the case.

Since swapping the GPUs around, everything is running great.

From the driver readme at: Chapter 7. Frequently Asked Questions

[i]My kernel log contains messages that are prefixed with “Xid”; what do these messages mean?

“Xid” messages indicate that a general GPU error occurred, most often due to the driver misprogramming the GPU or to corruption of the commands sent to the GPU. These messages provide diagnostic information that can be used by NVIDIA to aid in debugging reported problems.

[/i]

As a note, it would be really, really great if Nvidia made available a reference on what the various Xid error codes mean.

Pete