I tried to optimize my code (sparse matrix vector multiplication) by choosing the optimal blocksize: I ran my code with different blockdimensions according to the warp occupancy calculator (compile code with -cubin option, read reg and shared mem usage and run the code for blocksizes with 100% warp occupancy) and chose the blockdimension with lowest runtime. My problem is that it does change when I take different matrix elements.
Is the runtime(blockdim,…,matrixelements) variable of the matrix elements , if so how can I choose the best blockdim when data is unknown?

The matrix dimensions are fixed when I test different block sizes. The only thing which is changing are the matrix elements initialized every time with random float numbers.

Actually, it is a very simple algorithm, as the sparse matrix S_ij has only non-zero elements on the diagonal S_ij = 0 if i!= j. So, the only thing it does is to multiply global_A_ij * global_S_ii, where each thread is responsible to compute one multiplication.

I dont really get what you mean by load balancing. How can I do that in CUDA- maybe let one thread do more work than just compute one multiplication?

Your algorithm shouldn’t have a load balancing issue, and performance shouldn’t depend on the elements.
Did you time the first run? That’s the most likely cause I can think of.