I am actually a newby to CUDA and I am trying to write a kernel that multiplies two matrices element by element using the simple straight forward approach but I am having a problem in deciding on what should be the block size and block number, and I also noticed that as I change it, I sometimes get wrong results as in no multiplications happens and I end up with all zeros in the output matrix so I was wondering if there was any basis on which we should chose these two parameters and I wanted to know what could be their effect on the kernle operation. I read in a paper that NVIDIA recommends 192 threads per block but I don’t really get it why?
Here is my kernel:
[codebox]global void MatrixMultiply(float *a, float *b, float *c, int row, int col)
int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i*MAX_ROW + j < MAX_ROW * MAX_COL) c[i*MAX_ROW + j] = a[i*MAX_ROW + j] * b[i*MAX_ROW + j];