CUBLAS performance issues


I have an iterative program that’s almost entirely CUBLAS. I’m running it on an 8800 GTS, and the performance seems somewhat lackluster. My CPU is a 2.4GHz Core Duo.

In the profiling run I set up, the operations involve a 512-element vector, a 4096 element vector, a 2048 element vector, a 4096x2048 matrix, and a 4096x4096 matrix.

The profiler results show that there is as much or more CPU time given to most CUBLAS calls, even when the BLAS calls take a long time; the CPU time scales to match. Is the expected? I’m getting simialrly unimpressive performance on a 64-bit Xeon with a Quadro 4600, and a Core-2 laptop with an NVS 140M.

Here are some typical profiler results:

sgemvt_main 1846.88 1858.16 0.5
sscal_gld_main 2.21 13.98 0.33
saxpy_gld_main 2.98 14.06 0.33
sdot_gld_main 10.05 21.37 1
memcopy 2.24
scopy_gld_main 3.2 14.48 0.67
saxpy_gld_main 3.55 14.74 0.67
sasum_gld_main 10.21 21.49 1
memcopy 2.24
sgemvt_main 2464.38 2475.92 0.5
sscal_gld_main 2.21 14.05 0.33
saxpy_gld_main 2.98 14.38 0.33
sdot_gld_main 10.02 21.39 1
memcopy 2.24
scopy_gld_main 3.17 14.48 0.67
saxpy_gld_main 3.58 15.11 0.67
sasum_gld_main 10.14 21.23 1
memcopy 2.24
sgemvt_main 1885.7 1896.89 0.5
sscal_gld_main 2.24 13.1 0.33
saxpy_gld_main 2.98 14.16 0.33
sdot_gld_main 9.98 21.26 1

See how sgemvt calls actually take longer on the CPU than the GPU? Why is that? Any ideas what I could be doing wrong?


It is normal that CPU time > GPU time. It takes the CPU some time to move values of input to your kernel to GPU, and to setup the grid & blocks, so your CPU time will always be higher than GPU time, and the relative overhead will be high for short kernels.

for you it looks like the overhead is about 10-14 usecs

So is CPU time the total time required for the function to return, or is it something else?

Also: What about those occupancy numbers? I wold assume that larger is better, so is there some optimal matrix size that will improve that?

I’m seeing several posts on the forum about CUBLAS performance. Are there any known techniques to make it faster? Right now the configuration I profiled (running alone, not with the profiler) is about 25x faster than the same sequence of operations in Matlab (averaged over thousands of calls). I was expecting a bit better, but is something wrong?


CPU time is indeed from starting the call to cuda till it returns (so includes the GPU time)

Occupancy does not have to be high to achieve good performance. For your other questions I don’t know as I have never used CUBLAS.