I am trying to run matrix multiplication in Matlab 2009a using CUDA. The .cu file I wrote compiled, but it gives the wrong answer. Also, I noticed that whenever I change the block size, the answer differ as well. May I know if I have done any mistake in my code? Thanks in advanced.
matmul.cu.txt (1.69 KB)
The error comes up in this part of your code:
dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
dim3 dimGrid(B.width/dimBlock.x,A.height/dimBlock.y);
With this, your matrixMul will only work if the dimension of the matrix is a multiple of BLOCK_SIZE.
For all other cases, the kernel will fail.
The following should do the rest. (N,M = Dimension of Matrix):
dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
dim3 dimGrid((N + B.width - 1)/dimBlock.x,(M + A.height - 1)/dimBlock.y);
I didn’t test it. Give it a try.
Thanks for your prompt reply. There are three matrix here. I presume that N and M refers to the dimension of the resultant matice? (let say C= A*B, N and M are for the dimension of C?)
EDIT : I tried multiplication for 16*16 matrix (all three matrices are of the same dimension). But still it gives incorrect answer.
You should check for errors after kernel invocation and/or cudaMemcpy:
...
size_t size=A.width*A.height*sizeof(float);
cutilSafeCall(cudaMalloc((void**)&dA.elements,size));
cutilSafeCall(cudaMemcpy(dA.elements,A.elements,size,cudaMemcpyHostToDevice));
size=B.width*B.height*sizeof(float);
cutilSafeCall(cudaMalloc((void**)&dB.elements,size));
cutilSafeCall(cudaMemcpy(dB.elements,B.elements,size,cudaMemcpyHostToDevice));
size=C.width*C.height*sizeof(float);
cutilSafeCall(cudaMalloc((void**)&dC.elements,size));
dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
dim3 dimGrid(B.width/dimBlock.x,A.height/dimBlock.y);
MatMulKernel<<<dimGrid,dimBlock>>>(dA,dB,dC);
cutilCheckMsg("Kernel execution failed");
cutilSafeCall(cudaMemcpy(C.elements,dC.elements,size,cudaMemcpyDeviceToHost));
...
Thanks for your help. I finally realize what’s wrong after some testing. During input, I used matmul(A,B) instead of matmul(single(A),single(B)). Looks like the source code can only handle single precision. I am just wondering why this happen though. I thought CUDA 2.3 support double precision? Or is it necessary to put some changes in the source code to enable the double precision support?
EDIT: I google for this problem. Apparently the line -arch sm_13 need to be added after the end of the COMFLAGS line in nvmexopts.bat to enabled double precision. But then…I add the line, and matlab can’t recognize the line…
EDIT 2: Solution found, just add the line to mexopts.bat file as well.
not able to download attachment
running linux