Please help: different result, emu vs non emu

Dear all CUDA developers,

I just started programming CUDA, and here I`m stuck at a very-very simple matrix multiplication code. The problem is, the result between emulation and device code is different. What should I do? Any help will be appreciated.

Thank you very much.

Host function:

void Matrix::MulDevice(const Matrix& A, const Matrix& B)


	dim3 dimBlock(A.width, B.height);

	dim3 dimGrid(1,1);



	printf("beginning multiplication\n");


	MatrixMulKernelV1<<<dimGrid, dimBlock>>>(this, &A, &B);


	printf("end multiplication\n");



Kernel function:

__global__ void MatrixMulKernelV1(Matrix* R, const Matrix* A, const Matrix* B)


	int tx = threadIdx.x;

	int ty = threadIdx.y;


	printf("thread %d,%d \n", tx, ty);



	float pValue = 0.0f;


	for (int k = 0; k < A->width; k++)


  float AElement = MatrixGetElmt(A, k, ty);

  float BElement = MatrixGetElmt(B, tx, k);

  pValue += AElement * BElement; 



	MatrixSetElmt(R, tx, ty, pValue);


At first glance I see nothing wrong in the algorithm, however

You probably don’t ever want to run a grid of 1x1 dimension, as it will not occupy the whole GPU, just a single shader multiprocessor. Your 8600GT has 2 of these.

If the number of threads in a block A.width * B.height exceeds 512, your kernel will no longer launch - as 512 is the maximum number of threads per block. Could this be the culprit?

I suggest adding CUT_CHECK_ERROR and CUDA_SAFE_CALL macros as it is done in the SDK code examples. When running in Debug mode you will see failures due to invalid arguments and other error conditions indicated clearly.


PS: the next step should be to split up the output matrix into several segments, each of which is computed by one thread block. This allows you to reduce the size of a single thread block and use a larger grid. A good number of threads per block is 256.

Thank you,

Although I havent splitted the matrix yet, im sure the size of the matrix didn`t exceed 512, because the size of matrix I multiplied was just 4x4 elements. And when the emu mode give me proper result, such as:

056 062 068 074
152 174 196 218
248 286 324 362
344 398 452 506

, the device mode just give me

0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0

, regardless any input I gave.

But anyway I will try to add CUT_CHECK_ERROR and CUDA_SAFE_CALL to see the problem clearly.

Uh, did you even transfer the data to and from the GPU? Doesn’t look like it to me.

It could be that you’re forgetting to copy your input matrices to the device or results back from the device. If you try to access host memory from the device (in the kernel function) this will fail, similarly trying to directly read device memory from the host will fail.

Host = Your PC and its RAM
Device = Your 8600GT with its 256MB or so of RAM

In emulation everything seems to work fine, because emulation does not model the difference between host and device memory. Everything just runs on the host.

You need to do some reading on the CudaMalloc and CudaMemCpy functions. There are also some simple examples in the SDK showing you how to use these.


I did the data transfer in different function, but yes, I`m sure the data transfer run properly. I also put several breaks in cudaMalloc and cudaMemcpy calls.

Otherwise, i think i would end up with 0`s in both emu and device code… which is not my case…

So what exactly did you copy?

A and B in void Matrix::MulDevice(const Matrix& A, const Matrix& B)

are located on the device (since you are passing pointers to them to the kernel)?

I’d really avoid having complex structs on the device, that is just likely to result in more complex kernels and obfuscate the code and make it near impossible to optimize IMO, not to mention that A->width will be a uncoalesced, very slow global memory read.


Thanks for the advice. Replacing A->width with an int argument will avoid coalescing, isn`t it?

Reimar, I tried as you said (avoiding complex structure on device) and it worked. Thank you very much!