Help with block size and block numbers

Hello all,

I am actually a newby to CUDA and I am trying to write a kernel that multiplies two matrices element by element using the simple straight forward approach but I am having a problem in deciding on what should be the block size and block number, and I also noticed that as I change it, I sometimes get wrong results as in no multiplications happens and I end up with all zeros in the output matrix so I was wondering if there was any basis on which we should chose these two parameters and I wanted to know what could be their effect on the kernle operation. I read in a paper that NVIDIA recommends 192 threads per block but I don’t really get it why?

Here is my kernel:

[codebox]global void MatrixMultiply(float *a, float *b, float *c, int row, int col)


int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

if (i*MAX_ROW + j < MAX_ROW * MAX_COL)

	c[i*MAX_ROW + j] = a[i*MAX_ROW + j] * b[i*MAX_ROW + j];



NVIDIA actually recommend at least 192 active threads per multiprocessor (which isn’t the same thing because there can be more than one active block per multiprocessor). They do that because that is the minimum number of threads required to cover instruction pipelining latency.

NVIDIA supply a spreadsheet which can be used to calculate occupancy. It is very instructive to play around with it to get a feel for what effects how many blocks and threads will be running simultaneously on each MP for different execution parameters and different size kernels (the number of registers and the amount of shared memory the kernel uses also effects how many blocks will run simultaneously).

A good “old fashioned” rule of thumb was to aim for the maximum number of simultaneous threads per MP that your card supports (either 768 or 1024 depending on which generation you have). So for a compute 1.1 card, you would want 128,192 or 256 threads per block (note multiples of 32), which would allow 6,4, or 3 active blocks per multiprocessor. You should try each and time the kernel and see which is faster.

You should probably change this

[codebox]if (i*MAX_ROW + j < MAX_ROW * MAX_COL)[/codebox]


[codebox]if (i < MAX_ROW && j < MAX_COL)[/codebox]