Hi dudes and dudettes,
when using the triangular matrix-vector multiplication “cublasStrmv()” with NxN-matrices smaller than 4096 Elements, say 3072, the calculation needs more time than using matrices of much larger sizes, e.g. N=10240.
With the help of NSIGHT, I found out, that the executing trmv_kernel always uses Block-Sizes of 512 Threads, if N<4096, and much smaller Block-Sizes, i.e. 128 Threads, if the matrix is of sizes >= 4096 Elements. The occupancy for N<4096 is 25% and 100% for N>=4096
The strange thing is, that NSIGHT shows as the “Block Limit Reason:” SharedMemory, when N<4096.
With N>=4096 the Block Limit Reason is Warps, Registers, Blocks.
I googled but i found no way to tell cublas to use smaller block-sizes manually. Does anyone here know a way, or is this simply a bug in the library?
I´d like to show the informations i gathered in a table, so you can see my point more quickly:
cublasStrmv with N<4096 Elements, NSIGHT occupancy output
Occupancy: 25%
Grid Dimension: 1,1,1
Block Dimension: 512,1,1
Duration: 4,4423.073 µs
Registers per Thread: 22
Static shared Memory per Block: 16280 bytes
Cache Config Executed: PREFER_SHARED
Allocated Warps per Block: 16
Allocated Registers per Block: 12288
Allocated Shared Memory per Block: 16384
Max Block Limit Warps: 4
Max Block Limit Registers: 5
Max Block Limit Shared Memory: 3
Block Limit Reason: Shared Memory
Achieved Occupancy: 0.05
cublasStrmv with N>=4096 Elements, NSIGHT occupancy output
Occupancy: 100%
Grid Dimension: 64,1,1
Block Dimension: 128,1,1
Duration: 1,343.360
Registers per Thread: 30
Static shared Memory per Block: 2048 bytes
Cache Config Executed: PREFER_SHARED
Allocated Warps per Block: 4
Allocated Registers per Block: 4096
Allocated Shared Memory per Block: 2048
Max Block Limit Warps: 16
Max Block Limit Registers: 16
Max Block Limit Shared Memory: 24
Block Limit Reason: Warps, Registers, Blocks
Achieved Occupancy: 0.26
You can see the calculation time needed per N-Elements in the picture. https://dl.dropboxusercontent.com/u/541241/cublasStrmv_bug.jpg
I´m using a GTX660 on a Windows 7 x64 with Visual Studio 2010 and CUDA 5.0 PR. I tried compiling for Shader-Models and Compute Capabilities 1.3, 2.0 and 3.0. This error always occurs!
EDIT:
When using cublasSgemv instead of cublasStrmv, I´m gettin CUBLAS_STATUS_EXECUTION_FAILED for N<4096, and no error at all for matrices with N>=4096.
Here are the two commands executed:
cublasStrmv(this->mainHandle, CUBLAS_FILL_MODE_LOWER, CUBLAS_OP_N, CUBLAS_DIAG_UNIT, this->size, d_M, this->size, d_new, 1)
and
cublasSgemv(this->mainHandle, CUBLAS_OP_N, this->size, this->size, &a, d_M, this->size, d_new, 1, &b, d_new, 1)
with a=1.0 and b=0.0;