I have just starting the task of porting some serious code from CUDA 3.2 to 4.0 and have come across some behaviour which leaves me scratching my head a bit. One of my linear algebra routines, which compiles, runs, and profiles fine in CUDA 3.2 and passes cuda-memcheck both with “production” build flags and inside cuda-gdb, won’t profile and fails cuda-memcheck with some weird (possibly internal) errors in 4.0rc2. The actual code itself seems to run OK and produces identical results to the 3.2 build version, but I am always interested in seeing whether newer toolchains can find latent problems which older versions miss.
Under CUDA 3.2 - a debugging build runs something like this:
(cuda-gdb) set cuda memcheck on
(cuda-gdb) run
Starting program: /home/avidday/build/felib4.0/cuda/cholf
[Thread debugging using libthread_db enabled]
[New process 7633]
N = 20000, hband = 1500, blocksize = 32
[New Thread 140018562803584 (LWP 7633)]
[Launch of CUDA Kernel 0 (gpumemset<double>) on Device 0]
[Launch of CUDA Kernel 1 (cdivkernel) on Device 0]
[Launch of CUDA Kernel 2 (cmodkernel) on Device 0]
[Launch of CUDA Kernel 3 (cdivkernel) on Device 0]
......
[Launch of CUDA Kernel 63 (cdivkernel) on Device 0]
[Launch of CUDA Kernel 64 (cmodkernel) on Device 0]
[Launch of CUDA Kernel 65 (dtrsm_r_lo_tr_main_hw_nu) on Device 0]
[Launch of CUDA Kernel 66 (syrk_kernelNT<double, false, false, 64, 16, 4, 16, 16>) on Device 0]
[Launch of CUDA Kernel 67 (uppertricopykernel0) on Device 0]
[Launch of CUDA Kernel 68 (dtrsm_r_lo_tr_main_fulltile_hw_nu) on Device 0]
[Launch of CUDA Kernel 69 (dgemm_main_hw_na_tb) on Device 0]
[Launch of CUDA Kernel 70 (dsyrk_lo_nt_main_hw_fulltile) on Device 0]
[Launch of CUDA Kernel 71 (uppertricopykernel1) on Device 0]
[Launch of CUDA Kernel 72 (cdivkernel) on Device 0]
[Launch of CUDA Kernel 73 (cmodkernel) on Device 0]
.....
[Launch of CUDA Kernel 44142 (cdivkernel) on Device 0]
[Termination of CUDA Kernel 44142 (cdivkernel) on Device 0]
[Launch of CUDA Kernel 44143 (cmodkernel) on Device 0]
[Termination of CUDA Kernel 44143 (cmodkernel) on Device 0]
gpu solution time = 926.276425
[New Thread 139646227404544 (LWP 7628)]
[New Thread 139646219011840 (LWP 7629)]
[New Thread 139646210619136 (LWP 7630)]
cpu lapack solution time = 3.358692
Program exited normally.
Under 4.0rc2 built with the same build flags:
(cuda-gdb) set cuda memcheck on
(cuda-gdb) run
Starting program: /home/avidday/build/felib4.0/cuda/cholf
[Thread debugging using libthread_db enabled]
[New process 7817]
N = 20000, hband = 1500, blocksize = 32
[New Thread 139658122073984 (LWP 7817)]
[Context Create of context 0x874610 on Device 0]
[Launch of CUDA Kernel 0 (gpumemset<double><<<(32,1,1),(32,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 1 (cdivkernel<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 2 (cmodkernel<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 3 (cdivkernel<<<(1,1,1),(128,1,1)>>>) on Device 0]
....
[Launch of CUDA Kernel 63 (cdivkernel<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 64 (cmodkernel<<<(1,1,1),(128,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 65 (trsm_right_kernel_val<double, 256, 4, false, false, true, false, false><<<(92,1,1),(256,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 66 (syrk_kernelNT_val<double, false, false, 64, 16, 4, 16, 16><<<(23,92,1),(16,4,1)>>>) on Device 0]
[Launch of CUDA Kernel 67 (uppertricopykernel0<<<(1,32,1),(32,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 68 (trsm_right_kernel_val<double, 256, 4, true, false, true, false, false><<<(2,1,1),(256,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 69 (gemm_kernel2x2_val<double, false, false, false, false, true><<<(2,92,1),(16,4,1)>>>) on Device 0]
[Launch of CUDA Kernel 70 (syherk_kernel_val<double, double, 256, 4, true, false, false, false, true, false><<<(2,2,1),(256,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 71 (uppertricopykernel1<<<(1,32,1),(32,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 72 (cdivkernel<<<(1,1,1),(128,1,1)>>>) on Device 0]
Error: Failed to read the grid index (dev=0, sm=0, wp=1, error=13).
The story is similar using the stand-alone version of cuda-memcheck - under 3.2 no problems, under 4.0rc2 I see this:
avidday@cuda:~/build/felib4.0/cuda$ cuda-memcheck ./cholf
========= CUDA-MEMCHECK
N = 20000, hband = 1500, blocksize = 32
========= Error: process didn't terminate successfully
========= ERROR SUMMARY: 0 errors
Running the same code in the profiler with all of the profiling options selected, the first run of the code runs to completion, while the next repetition fails:
=== Start profiling for session 'Session1' ===
Start program '/home/avidday/build/felib4.0/cuda/cholf ' run #1 ...
N = 20000, hband = 1500, blocksize = 32
gpu solution time = 0.780827
cpu lapack solution time = 3.280818
Program run #1 completed.
Start program '/home/avidday/build/felib4.0/cuda/cholf ' run #2 ...
N = 20000, hband = 1500, blocksize = 32
unspecified launch failure choleskygpu.cuh line 204
Program run #2 failed, exit code: 252
Error in program execution.
I can run the code outside the profiler continuously for hours without any errors - it only seems to fail when some types of external instrumentation or profiling is turned on.
Has anyone else seen anything like this? (All examples are done on 64 bit Ubuntu 10.04LTS with the 270.40 development driver running on a compute dedicated GTX470).