Low Occupancy / Performance cublasStrmv() with N < 4096 Matrices

Hi dudes and dudettes,

when using the triangular matrix-vector multiplication “cublasStrmv()” with NxN-matrices smaller than 4096 Elements, say 3072, the calculation needs more time than using matrices of much larger sizes, e.g. N=10240.
With the help of NSIGHT, I found out, that the executing trmv_kernel always uses Block-Sizes of 512 Threads, if N<4096, and much smaller Block-Sizes, i.e. 128 Threads, if the matrix is of sizes >= 4096 Elements. The occupancy for N<4096 is 25% and 100% for N>=4096
The strange thing is, that NSIGHT shows as the “Block Limit Reason:” SharedMemory, when N<4096.
With N>=4096 the Block Limit Reason is Warps, Registers, Blocks.

I googled but i found no way to tell cublas to use smaller block-sizes manually. Does anyone here know a way, or is this simply a bug in the library?

I´d like to show the informations i gathered in a table, so you can see my point more quickly:

cublasStrmv with N<4096 Elements, NSIGHT occupancy output
Occupancy: 25%
Grid Dimension: 1,1,1
Block Dimension: 512,1,1
Duration: 4,4423.073 µs
Registers per Thread: 22
Static shared Memory per Block: 16280 bytes
Cache Config Executed: PREFER_SHARED
Allocated Warps per Block: 16
Allocated Registers per Block: 12288
Allocated Shared Memory per Block: 16384
Max Block Limit Warps: 4
Max Block Limit Registers: 5
Max Block Limit Shared Memory: 3
Block Limit Reason: Shared Memory
Achieved Occupancy: 0.05

cublasStrmv with N>=4096 Elements, NSIGHT occupancy output
Occupancy: 100%
Grid Dimension: 64,1,1
Block Dimension: 128,1,1
Duration: 1,343.360
Registers per Thread: 30
Static shared Memory per Block: 2048 bytes
Cache Config Executed: PREFER_SHARED
Allocated Warps per Block: 4
Allocated Registers per Block: 4096
Allocated Shared Memory per Block: 2048
Max Block Limit Warps: 16
Max Block Limit Registers: 16
Max Block Limit Shared Memory: 24
Block Limit Reason: Warps, Registers, Blocks
Achieved Occupancy: 0.26

You can see the calculation time needed per N-Elements in the picture. https://dl.dropboxusercontent.com/u/541241/cublasStrmv_bug.jpg

I´m using a GTX660 on a Windows 7 x64 with Visual Studio 2010 and CUDA 5.0 PR. I tried compiling for Shader-Models and Compute Capabilities 1.3, 2.0 and 3.0. This error always occurs!

EDIT:
When using cublasSgemv instead of cublasStrmv, I´m gettin CUBLAS_STATUS_EXECUTION_FAILED for N<4096, and no error at all for matrices with N>=4096.

Here are the two commands executed:

cublasStrmv(this->mainHandle, CUBLAS_FILL_MODE_LOWER, CUBLAS_OP_N, CUBLAS_DIAG_UNIT, this->size, d_M, this->size, d_new, 1)

and

cublasSgemv(this->mainHandle, CUBLAS_OP_N, this->size, this->size, &a, d_M, this->size, d_new, 1, &b, d_new, 1)

with a=1.0 and b=0.0;

Found the errror in my cublasSgemv execution. However, the strange behaviour of cublasStrmv remains, while doing N-element vector X NxN-element matrix multiplication with cublasSgemv does not show the bulge for N<4096 shown in the picture above.

So, i think this is a bug in the CUBLAS-Library.

Updated the picture, so you can compare the execution time of the two CUBLAS-Functions

Specialistili

TRMV has an in-place API which is not suitable for parallelism implementation.
So we have 2 kernels for this routine. For the second kernel, we allocate a temporary vector and copy the input vector to it : this way we can have an efficient implementation.

For the first kernel. we copy the input vector into shared mem.
We have an heuristic to switch from kernel1 to kernel2 depending on the size of the vector.

In the coming release CUDA5.5, thanks to your feedback, we changed that heuristic to trigger the second kernel earlier. The bulge should disappear.

THe drawback of the second kernel is that it does a cudaMalloc/cudaFree which might trigger an implicit cudaDeviceSynchronize().

Thank you for your answer.

Specialistili