Please help: different result, emu vs non emu

maria.anna · September 9, 2008, 3:58am

Dear all CUDA developers,

I just started programming CUDA, and here I`m stuck at a very-very simple matrix multiplication code. The problem is, the result between emulation and device code is different. What should I do? Any help will be appreciated.

Thank you very much.

Host function:

void Matrix::MulDevice(const Matrix& A, const Matrix& B)

{

	dim3 dimBlock(A.width, B.height);

	dim3 dimGrid(1,1);

	

#ifdef __DEVICE_EMULATION__

	printf("beginning multiplication\n");

#endif

	MatrixMulKernelV1<<<dimGrid, dimBlock>>>(this, &A, &B);

#ifdef __DEVICE_EMULATION__

	printf("end multiplication\n");

#endif

}

Kernel function:

__global__ void MatrixMulKernelV1(Matrix* R, const Matrix* A, const Matrix* B)

{

	int tx = threadIdx.x;

	int ty = threadIdx.y;

#ifdef __DEVICE_EMULATION__

	printf("thread %d,%d \n", tx, ty);

#endif

	

	float pValue = 0.0f;

	

	for (int k = 0; k < A->width; k++)

	{

  float AElement = MatrixGetElmt(A, k, ty);

  float BElement = MatrixGetElmt(B, tx, k);

  pValue += AElement * BElement; 

	}

	

	MatrixSetElmt(R, tx, ty, pValue);

}

cbuchner1 · September 9, 2008, 8:42am

At first glance I see nothing wrong in the algorithm, however

You probably don’t ever want to run a grid of 1x1 dimension, as it will not occupy the whole GPU, just a single shader multiprocessor. Your 8600GT has 2 of these.

If the number of threads in a block A.width * B.height exceeds 512, your kernel will no longer launch - as 512 is the maximum number of threads per block. Could this be the culprit?

I suggest adding CUT_CHECK_ERROR and CUDA_SAFE_CALL macros as it is done in the SDK code examples. When running in Debug mode you will see failures due to invalid arguments and other error conditions indicated clearly.

Christian

PS: the next step should be to split up the output matrix into several segments, each of which is computed by one thread block. This allows you to reduce the size of a single thread block and use a larger grid. A good number of threads per block is 256.

maria.anna · September 9, 2008, 9:11am

Thank you,

Although I havent splitted the matrix yet, im sure the size of the matrix didn`t exceed 512, because the size of matrix I multiplied was just 4x4 elements. And when the emu mode give me proper result, such as:

056 062 068 074
152 174 196 218
248 286 324 362
344 398 452 506

, the device mode just give me

0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0

, regardless any input I gave.

But anyway I will try to add CUT_CHECK_ERROR and CUDA_SAFE_CALL to see the problem clearly.

Reimar · September 9, 2008, 9:34am

Uh, did you even transfer the data to and from the GPU? Doesn’t look like it to me.

cbuchner1 · September 9, 2008, 9:37am

It could be that you’re forgetting to copy your input matrices to the device or results back from the device. If you try to access host memory from the device (in the kernel function) this will fail, similarly trying to directly read device memory from the host will fail.

Note:
Host = Your PC and its RAM
Device = Your 8600GT with its 256MB or so of RAM

In emulation everything seems to work fine, because emulation does not model the difference between host and device memory. Everything just runs on the host.

You need to do some reading on the CudaMalloc and CudaMemCpy functions. There are also some simple examples in the SDK showing you how to use these.

Christian

maria.anna · September 9, 2008, 9:46am

I did the data transfer in different function, but yes, I`m sure the data transfer run properly. I also put several breaks in cudaMalloc and cudaMemcpy calls.

Otherwise, i think i would end up with 0`s in both emu and device code… which is not my case…

Reimar · September 9, 2008, 10:10am

So what exactly did you copy?

A and B in void Matrix::MulDevice(const Matrix& A, const Matrix& B)

are located on the device (since you are passing pointers to them to the kernel)?

I’d really avoid having complex structs on the device, that is just likely to result in more complex kernels and obfuscate the code and make it near impossible to optimize IMO, not to mention that A->width will be a uncoalesced, very slow global memory read.

maria.anna · September 9, 2008, 10:29am

Precisely.

Thanks for the advice. Replacing A->width with an int argument will avoid coalescing, isn`t it?

maria.anna · September 11, 2008, 6:42am

Reimar, I tried as you said (avoiding complex structure on device) and it worked. Thank you very much!

Topic		Replies	Views
Emulator works but G80 doesn't CUDA Programming and Performance	11	5451	July 3, 2007
Different Output on Device and Emulation Mode 2 What makes difference b/w Emu and Dev? CUDA Programming and Performance	15	13506	December 2, 2007
emulation mode and debug mode gave me totally different results! CUDA Programming and Performance	2	1636	May 21, 2009
Code works under emulation, but fails on the device CUDA Programming and Performance	3	2194	July 30, 2009
Different results with and without emulation mode CUDA Programming and Performance	6	1678	February 1, 2010
Different results emu versus real runs CUDA Programming and Performance	1	2317	September 11, 2007
emu vs debug, different values CUDA Programming and Performance	48	15856	February 5, 2009
Emulation/CPU=correct,Execution/GPU=incorrect emulation CUDA Programming and Performance	26	21550	September 2, 2008
Program gives wrong answer except with emulation CUDA Programming and Performance	8	3907	April 21, 2008
Problem:Different results on EMULATION and RELEASE Problem on release and debug mode CUDA Programming and Performance	4	1895	June 5, 2008

Please help: different result, emu vs non emu

Related topics