fail the transpose program in cuda examples ubuntu 14.04 cuda-7.5 nvidia-driver:352.39

Today I tried installing cuda on my Dell server. The information of my server and gpu is:

uname -a
Linux sem-PowerEdge-T630 4.2.0-27-generic #32~14.04.1-Ubuntu SMP Fri Jan 22 15:32:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

lspci | grep NVIDIA
04:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
04:00.1 Audio device: NVIDIA Corporation Device 0fb0 (rev a1)

my nvidia-driver is 352.39

I install cuda as the following link:
https://devtalk.nvidia.com/default/topic/878117/-solved-titan-x-for-cuda-7-5-login-loop-error-ubuntu-14-04-/, and I try to test cuda.

./deviceQuery

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX TITAN X"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 12288 MBytes (12884705280 bytes)
  (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores
  GPU Max Clock rate:                            1076 MHz (1.08 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX TITAN X
Result = PASS

deviceQuery has no problem.

./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX TITAN X
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12132.0

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12463.3

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			249418.5

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

bandwidthTest works well for this time, but at the first few times, when I type ./bandwidthTest, the screen stops at

Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12463.3

and then reboot.

./transpose
Transpose Starting...

GPU Device 0: "GeForce GTX TITAN X" with compute capability 5.2

> Device 0: "GeForce GTX TITAN X"
> SM Capability 5.2 detected:
> [GeForce GTX TITAN X] has 24 MP(s) x 128 (Cores/MP) = 3072 (Cores)
> Compute performance scaling factor = 1.00

Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16

transpose simple copy       , Throughput = 229.2142 GB/s, Time = 0.03408 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transpose doesn’t work well, it stops at

Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16

transpose simple copy       , Throughput = 229.2142 GB/s, Time = 0.03408 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

and then reboot. The server throws an error: pci 1318 fetal error.

I use cuda-gdb to run transpose, it appears like this:

run transpose
Starting program: /home/sem/NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/transpose transpose
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Transpose Starting...

GPU Device 0: "GeForce GTX TITAN X" with compute capability 5.2

> Device 0: "GeForce GTX TITAN X"
> SM Capability 5.2 detected:
> [GeForce GTX TITAN X] has 24 MP(s) x 128 (Cores/MP) = 3072 (Cores)
> Compute performance scaling factor = 1.00
[New Thread 0x7ffff5013700 (LWP 1934)]
[New Thread 0x7fffec7ff700 (LWP 1935)]

Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16

transpose simple copy       , Throughput = 126.1617 GB/s, Time = 0.06192 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose shared memory copy, Throughput = 131.3269 GB/s, Time = 0.05949 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose naive             , Throughput = 65.3947 GB/s, Time = 0.11947 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coalesced         , Throughput = 127.8565 GB/s, Time = 0.06110 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose optimized         , Throughput = 132.2059 GB/s, Time = 0.05909 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose coarse-grained    , Throughput = 127.2712 GB/s, Time = 0.06138 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transpose fine-grained      , Throughput = 134.0107 GB/s, Time = 0.05830 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

Connection closed by foreign host.

the screen again stops, and then reboot.

I am rather confused. Any help?

Test

maybe your GPU is overheating. Or maybe you have not correctly hooked up aux power to your GPU. Or maybe when you installed CUDA, you did not properly remove the nouveau driver.

@txbob Thanks for your reply!

  1. I monitored the temperature while running the test. It is about 70 degree centigrade. I think it is not very high.
  2. What does hooking up aux power to the GPU mean? Something to do with dual power? We tried dual power, but no effects.
  3. I followed the link https://devtalk.nvidia.com/default/topic/878117/-solved-titan-x-for-cuda-7-5-login-loop-error-ubuntu-14-04-/ to install cuda (you joined the discuss before). I did everything the same but in
3) Create the /etc/modprobe.d/blacklist-nouveau.conf file with :
blacklist nouveau
option nouveau modeset=0

my version is

3) Create the /etc/modprobe.d/blacklist-nouveau.conf file with :
blacklist nouveau

because if I add

option nouveau modeset=0

, there will be some problems in execute

sudo update-initramfs -u

. Something wrong here?

It’s possible that nouveau is still not removed.

try changing option to options and rerun the initrd update

70C sounds pretty high to me for a GPU that is not doing much, but I agree that is not too high

I tried it, but didn’t work.