Undeterministic kernel configuration


I tried to optimize my code (sparse matrix vector multiplication) by choosing the optimal blocksize: I ran my code with different blockdimensions according to the warp occupancy calculator (compile code with -cubin option, read reg and shared mem usage and run the code for blocksizes with 100% warp occupancy) and chose the blockdimension with lowest runtime. My problem is that it does change when I take different matrix elements.
Is the runtime(blockdim,…,matrixelements) variable of the matrix elements , if so how can I choose the best blockdim when data is unknown?

Does someone else encountered the same problem ?

thx for help in advance,


What’s your algorithm? Load balancing may be much more important than occupancy in some cases, and it depends on both block size and matrix shape.

The matrix dimensions are fixed when I test different block sizes. The only thing which is changing are the matrix elements initialized every time with random float numbers.

Actually, it is a very simple algorithm, as the sparse matrix S_ij has only non-zero elements on the diagonal S_ij = 0 if i!= j. So, the only thing it does is to multiply global_A_ij * global_S_ii, where each thread is responsible to compute one multiplication.

I dont really get what you mean by load balancing. How can I do that in CUDA- maybe let one thread do more work than just compute one multiplication?

thx a lot in advance

Your algorithm shouldn’t have a load balancing issue, and performance shouldn’t depend on the elements.
Did you time the first run? That’s the most likely cause I can think of.