I just started programming CUDA, and here I`m stuck at a very-very simple matrix multiplication code. The problem is, the result between emulation and device code is different. What should I do? Any help will be appreciated.
At first glance I see nothing wrong in the algorithm, however
You probably don’t ever want to run a grid of 1x1 dimension, as it will not occupy the whole GPU, just a single shader multiprocessor. Your 8600GT has 2 of these.
If the number of threads in a block A.width * B.height exceeds 512, your kernel will no longer launch - as 512 is the maximum number of threads per block. Could this be the culprit?
I suggest adding CUT_CHECK_ERROR and CUDA_SAFE_CALL macros as it is done in the SDK code examples. When running in Debug mode you will see failures due to invalid arguments and other error conditions indicated clearly.
Christian
PS: the next step should be to split up the output matrix into several segments, each of which is computed by one thread block. This allows you to reduce the size of a single thread block and use a larger grid. A good number of threads per block is 256.
Although I havent splitted the matrix yet, im sure the size of the matrix didn`t exceed 512, because the size of matrix I multiplied was just 4x4 elements. And when the emu mode give me proper result, such as:
It could be that you’re forgetting to copy your input matrices to the device or results back from the device. If you try to access host memory from the device (in the kernel function) this will fail, similarly trying to directly read device memory from the host will fail.
Note:
Host = Your PC and its RAM
Device = Your 8600GT with its 256MB or so of RAM
In emulation everything seems to work fine, because emulation does not model the difference between host and device memory. Everything just runs on the host.
You need to do some reading on the CudaMalloc and CudaMemCpy functions. There are also some simple examples in the SDK showing you how to use these.
I did the data transfer in different function, but yes, I`m sure the data transfer run properly. I also put several breaks in cudaMalloc and cudaMemcpy calls.
Otherwise, i think i would end up with 0`s in both emu and device code… which is not my case…
A and B in void Matrix::MulDevice(const Matrix& A, const Matrix& B)
are located on the device (since you are passing pointers to them to the kernel)?
I’d really avoid having complex structs on the device, that is just likely to result in more complex kernels and obfuscate the code and make it near impossible to optimize IMO, not to mention that A->width will be a uncoalesced, very slow global memory read.