Hello all,

I am actually a newby to CUDA and I am trying to write a kernel that multiplies two matrices element by element using the simple straight forward approach but I am having a problem in deciding on what should be the block size and block number, and I also noticed that as I change it, I sometimes get wrong results as in no multiplications happens and I end up with all zeros in the output matrix so I was wondering if there was any basis on which we should chose these two parameters and I wanted to know what could be their effect on the kernle operation. I read in a paper that NVIDIA recommends 192 threads per block but I don’t really get it why?

Here is my kernel:

[codebox]**global** void MatrixMultiply(float *a, float *b, float *c, int row, int col)

{

```
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i*MAX_ROW + j < MAX_ROW * MAX_COL)
c[i*MAX_ROW + j] = a[i*MAX_ROW + j] * b[i*MAX_ROW + j];
```

}[/codebox]

Thanks