Low Occupancy / Performance cublasStrmv() with N < 4096 Matrices

Specialistili · April 16, 2013, 11:33am

Hi dudes and dudettes,

when using the triangular matrix-vector multiplication “cublasStrmv()” with NxN-matrices smaller than 4096 Elements, say 3072, the calculation needs more time than using matrices of much larger sizes, e.g. N=10240.
With the help of NSIGHT, I found out, that the executing trmv_kernel always uses Block-Sizes of 512 Threads, if N<4096, and much smaller Block-Sizes, i.e. 128 Threads, if the matrix is of sizes >= 4096 Elements. The occupancy for N<4096 is 25% and 100% for N>=4096
The strange thing is, that NSIGHT shows as the “Block Limit Reason:” SharedMemory, when N<4096.
With N>=4096 the Block Limit Reason is Warps, Registers, Blocks.

I googled but i found no way to tell cublas to use smaller block-sizes manually. Does anyone here know a way, or is this simply a bug in the library?

I´d like to show the informations i gathered in a table, so you can see my point more quickly:

cublasStrmv with N<4096 Elements, NSIGHT occupancy output
Occupancy: 25%
Grid Dimension: 1,1,1
Block Dimension: 512,1,1
Duration: 4,4423.073 µs
Registers per Thread: 22
Static shared Memory per Block: 16280 bytes
Cache Config Executed: PREFER_SHARED
Allocated Warps per Block: 16
Allocated Registers per Block: 12288
Allocated Shared Memory per Block: 16384
Max Block Limit Warps: 4
Max Block Limit Registers: 5
Max Block Limit Shared Memory: 3
Block Limit Reason: Shared Memory
Achieved Occupancy: 0.05

cublasStrmv with N>=4096 Elements, NSIGHT occupancy output
Occupancy: 100%
Grid Dimension: 64,1,1
Block Dimension: 128,1,1
Duration: 1,343.360
Registers per Thread: 30
Static shared Memory per Block: 2048 bytes
Cache Config Executed: PREFER_SHARED
Allocated Warps per Block: 4
Allocated Registers per Block: 4096
Allocated Shared Memory per Block: 2048
Max Block Limit Warps: 16
Max Block Limit Registers: 16
Max Block Limit Shared Memory: 24
Block Limit Reason: Warps, Registers, Blocks
Achieved Occupancy: 0.26

You can see the calculation time needed per N-Elements in the picture. https://dl.dropboxusercontent.com/u/541241/cublasStrmv_bug.jpg

I´m using a GTX660 on a Windows 7 x64 with Visual Studio 2010 and CUDA 5.0 PR. I tried compiling for Shader-Models and Compute Capabilities 1.3, 2.0 and 3.0. This error always occurs!

EDIT:
When using cublasSgemv instead of cublasStrmv, I´m gettin CUBLAS_STATUS_EXECUTION_FAILED for N<4096, and no error at all for matrices with N>=4096.

Here are the two commands executed:

cublasStrmv(this->mainHandle, CUBLAS_FILL_MODE_LOWER, CUBLAS_OP_N, CUBLAS_DIAG_UNIT, this->size, d_M, this->size, d_new, 1)

and

cublasSgemv(this->mainHandle, CUBLAS_OP_N, this->size, this->size, &a, d_M, this->size, d_new, 1, &b, d_new, 1)

with a=1.0 and b=0.0;

Specialistili · April 17, 2013, 7:08am

Found the errror in my cublasSgemv execution. However, the strange behaviour of cublasStrmv remains, while doing N-element vector X NxN-element matrix multiplication with cublasSgemv does not show the bulge for N<4096 shown in the picture above.

So, i think this is a bug in the CUBLAS-Library.

Updated the picture, so you can compare the execution time of the two CUBLAS-Functions

Specialistili

philippev · April 22, 2013, 5:25pm

TRMV has an in-place API which is not suitable for parallelism implementation.
So we have 2 kernels for this routine. For the second kernel, we allocate a temporary vector and copy the input vector to it : this way we can have an efficient implementation.

For the first kernel. we copy the input vector into shared mem.
We have an heuristic to switch from kernel1 to kernel2 depending on the size of the vector.

In the coming release CUDA5.5, thanks to your feedback, we changed that heuristic to trigger the second kernel earlier. The bulge should disappear.

THe drawback of the second kernel is that it does a cudaMalloc/cudaFree which might trigger an implicit cudaDeviceSynchronize().

Specialistili · April 23, 2013, 6:32am

Thank you for your answer.

Specialistili

Topic		Replies	Views
Matrix-Matrix Multiplication Accuracy and Performance Questions CUDA Programming and Performance	13	6724	April 16, 2007
simpleCUBLAS vs matmul Implementation CUDA Programming and Performance	5	9993	April 10, 2007
Is there any function in cublas limit the matrix size less than or equal to 48 CUDA Programming and Performance	0	1336	May 25, 2014
CUBLAS Configuration The use of CUBLAS for small matrix CUDA Programming and Performance	3	3800	April 4, 2007
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18302	March 30, 2011
Naive Matrix Multiplication: Confused about Optimal Threadblock Size CUDA Programming and Performance cuda	8	569	June 11, 2024
CUBLAS: Very low occupancy CUDA Programming and Performance	3	1124	December 11, 2015
CUBLAS sgemv slower than CBLAS for small matrix sizes CUDA Programming and Performance	2	1581	February 1, 2010
Cublas, cublasSgemv Matrix vector operation size Limitation CUDA Programming and Performance	2	10402	August 14, 2008
Why is my cublas so slow and is there anything I can do to fix it? CUDA Programming and Performance	1	1546	June 27, 2018

Low Occupancy / Performance cublasStrmv() with N < 4096 Matrices

Related topics