I am trying to run matrix multiplication in Matlab 2009a using CUDA. The .cu file I wrote compiled, but it gives the wrong answer. Also, I noticed that whenever I change the block size, the answer differ as well. May I know if I have done any mistake in my code? Thanks in advanced.

matmul.cu.txt (1.69 KB)

The error comes up in this part of your code:

```
dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
dim3 dimGrid(B.width/dimBlock.x,A.height/dimBlock.y);
```

With this, your matrixMul will only work if the dimension of the matrix is a multiple of BLOCK_SIZE.

For all other cases, the kernel will fail.

The following should do the rest. (N,M = Dimension of Matrix):

```
dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
dim3 dimGrid((N + B.width - 1)/dimBlock.x,(M + A.height - 1)/dimBlock.y);
```

I didn’t test it. Give it a try.

Thanks for your prompt reply. There are three matrix here. I presume that N and M refers to the dimension of the resultant matice? (let say C= A*B, N and M are for the dimension of C?)

EDIT : I tried multiplication for 16*16 matrix (all three matrices are of the same dimension). But still it gives incorrect answer.

You should check for errors after kernel invocation and/or cudaMemcpy:

```
...
size_t size=A.width*A.height*sizeof(float);
cutilSafeCall(cudaMalloc((void**)&dA.elements,size));
cutilSafeCall(cudaMemcpy(dA.elements,A.elements,size,cudaMemcpyHostToDevice));
size=B.width*B.height*sizeof(float);
cutilSafeCall(cudaMalloc((void**)&dB.elements,size));
cutilSafeCall(cudaMemcpy(dB.elements,B.elements,size,cudaMemcpyHostToDevice));
size=C.width*C.height*sizeof(float);
cutilSafeCall(cudaMalloc((void**)&dC.elements,size));
dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
dim3 dimGrid(B.width/dimBlock.x,A.height/dimBlock.y);
MatMulKernel<<<dimGrid,dimBlock>>>(dA,dB,dC);
cutilCheckMsg("Kernel execution failed");
cutilSafeCall(cudaMemcpy(C.elements,dC.elements,size,cudaMemcpyDeviceToHost));
...
```

Thanks for your help. I finally realize what’s wrong after some testing. During input, I used matmul(A,B) instead of matmul(single(A),single(B)). Looks like the source code can only handle single precision. I am just wondering why this happen though. I thought CUDA 2.3 support double precision? Or is it necessary to put some changes in the source code to enable the double precision support?

EDIT: I google for this problem. Apparently the line -arch sm_13 need to be added after the end of the COMFLAGS line in nvmexopts.bat to enabled double precision. But then…I add the line, and matlab can’t recognize the line…

EDIT 2: Solution found, just add the line to mexopts.bat file as well.

not able to download attachment

running linux