Emulation works, Debug doesn't

vojdan · January 27, 2010, 5:03pm

Hello,
I am new to CUDA and I have something strange happening.

If I execute under EMULATION mode, the code works fine, but if I execute in DEBUG mode (on the GPU) it doesn’t work, or more accurately doesn’t give correct results. When executing in debug mode (on gpu) the values I get back from the kernel are 0.
I modified the code of C++ integration example in the SDK because there was an existing C++ program to parallelize.

I need to give about 30 variables to the kernel, and get 4 back. Here is how I’m doing this…
C++ code:
runTest(Xr,Xt,Yr,Yt,Ys,Ye,Zs,a_xy,b_xy,a_xz,b_xz,Xs,d,lambda
,&max_h,&max_v,cos_alfa,&isFresnel,&isLOS,percent_,resolution,sin_beta,cos_beta,a_elipse,b_e
lipse,x_elipse,y_elipse,topo->grid,topo->maxX,topo->maxY,topo->grid_resolution,topo->scaleC,topo->scaleR);

cppIntegration.cu:
kernel<<<1, threadsPerBlock>>>(Xr,Xt,Yr,Yt,Zs,a_xy,b_xy,a_xz,b_xz,d,lambda,cos_alfa,per
cent_,resolution,sin_beta,cos_beta, a_elipse, b_elipse, x_elipse, y_elipse, h_grid, grid_resolution, scaleC,scaleR, d_maxh, d_maxv, d_isFresnel, d_isLOS);

The values I need to get are d_maxh, d_maxv, d_isFresnel and d_isLOS. The code for their initialization is:
double* d_maxv;
size_t sizeMaxv = numsizeof(double);
cudaMalloc((void**)&d_maxv, sizeMaxv);
double h_maxv = (double*)malloc(sizeMaxv);

I don’t copy start values because they are calculated on the GPU, but I get the results with:
cudaMemcpy(h_maxv, d_maxv, sizeMaxv, cudaMemcpyDeviceToHost);

Does anyone have an idea?

Thanks,
Vojdan.

vojdan · January 27, 2010, 7:50pm

I just tried this with the same settings:

[indent]

runTest(…){
…
size_t sizetry = numsizeof(double);
double h_try = (double*)malloc(sizetry);
for(int i=0; i<num; i++)
h_try[i]=1;

for(int i=0; i<num; i++)
	printf("%f ",h_try[i]);

double* d_try;
cudaMalloc((void**)&d_try, sizetry);
cudaMemcpy(d_try, h_try, sizetry, cudaMemcpyHostToDevice);

kernel2<<<1,threadsPerBlock>>>(d_try);

cudaMemcpy(h_try, d_try, sizetry, cudaMemcpyDeviceToHost);
for(int p=0;p<num-1;p++)
	printf("\nTRY: %f",h_try[p]);

}
[/indent]

If I run this in emulation mode the result print is 2.0 for all the entries, and if I run it in debug mode, all the result entries are 1.0 like before the kernel ran.

I am working with CUDA 3.0 on Windows 7 with GeForce 8800GTS.

avidday · January 27, 2010, 7:56pm

So what is threadsPerBlock?

tmurray · January 27, 2010, 8:10pm

Why aren’t you checking errors? You should be checking errors.

vojdan · January 27, 2010, 8:24pm

Threads per block is the number of threads running:

const unsigned int num_threads = num;

int threadsPerBlock = num_threads;

vojdan · January 27, 2010, 8:30pm

I was reading the Programing Guide and the Best Practices, but I couldn’t quite get it :(

How do i print the result from cudaGetLastError()?

if (cudaSuccess == cudaGetLastError())

someVar = 1;

Something like this and then find out where it went wrong?

avidday · January 27, 2010, 8:32pm

Obviously, but what is its value?

vojdan · January 27, 2010, 8:49pm

:) it’s a dynamic value, depending on the coordinates, but in this case it’s 63.

int threadsPerBlock = num_threads;

printf("NUMBER OF THREADS: %d",num_threads);

tmurray · January 27, 2010, 9:22pm

Every function returns a cudaError_t. You are probably using a host pointer instead of a device pointer at some point and your kernel isn’t actually running successfully.

vojdan · January 27, 2010, 10:09pm

Now this is totally confusing, I am doing the same thing for two arrays and one is fine, the other isn’t.

here is the code:

[indent] //try

size_t sizetry = num*sizeof(double);

//matrix

unsigned int size_A = 10 * 10;

unsigned int mem_size_A = sizeof(float) * size_A;



//try

double* h_try = (double*)malloc(sizetry);

for(int i=0; i<num; i++) h_try[i]=1;

for(int i=0; i<num; i++) printf("\ntry before: %lf ",h_try[i]);

//matrix

float* h_A = (float*) malloc(mem_size_A);

for (int i = 0; i < size_A; ++i) h_A[i] = rand() / (float)50;

for(int p=0;p<num;p++) printf("\nmatrix before: %f",h_A[p]);

//try

double* d_try;

cutilSafeCall(cudaMalloc((void**)&d_try, sizetry));

cutilSafeCall(cudaMemcpy(d_try, h_try, sizetry, cudaMemcpyHostToDevice));

//matrix

float* d_A;

cutilSafeCall(cudaMalloc((void**) &d_A, mem_size_A));

cutilSafeCall(cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice) );

kernel2<<<1,threadsPerBlock>>>(d_try,d_A);

//try

cutilSafeCall(cudaMemcpy(h_try, d_try, sizetry, cudaMemcpyDeviceToHost));

//matrix

cutilSafeCall(cudaMemcpy(h_A, d_A, mem_size_A,  cudaMemcpyDeviceToHost) );

for(int p=0;p<num;p++)

	printf("\ntry after: %lf",h_try[p]);

for(int p=0;p<num;p++)

	printf("\nmatrix after: %lf",h_A[p]);[/indent]

and the kernel is:

[indent]global void

kernel2( double* g_data, float* g_data2 )

{

    g_data[threadIdx.x]++;

g_data2[threadIdx.x]++;

}[/indent]

the result for “try” is 1 before and after the kernel execution.

the result for “matrix” is random before and random+1 after!?!?

What am I missing?? :mellow: :mellow:

avidday · January 28, 2010, 9:26am

Could you post a compact, complete, compilable, runnable example which illustrates your problem? Posting random code snippets and expecting others to guess what might be going on is neither helpful nor considerate. It also makes life considerably easier for others if you post your code using the code formatting the forum software provides. Like this:

__global__ void kernel2( double* g_data, float* g_data2 )

{

	g_data[threadIdx.x]++;

	g_data2[threadIdx.x]++;

}

vojdan · January 28, 2010, 3:04pm

It’s quite a big project so I doubt I could send a part of it, but I would gladly send you the hole project in a private message. Can I somehow do that?

While debugging, I noticed something strange. If I give a value to some specific variables at the beggining or at the end of the kernel, they get that value.

But, if I give a value to the variables somewhere in the middle of the kernel, it looks like no value is given to them at all, and they have the starting value.

This is the kernel:

__global__ void

kernel(float Xr, float Xt, float Yr, float Yt,  float Zs, float a_xy, float b_xy, float a_xz, float b_xz, float d, float lambda,  float cos_alfa, int percent_, int resolution, float sin_beta, float cos_beta, float a_elipse, float b_elipse, float x_elipse, float y_elipse,unsigned char** d_grid, float grid_resolution, float scaleC, float scaleR, float* max_h, float* max_v, bool *isFresnel, bool* isLOS)

{	

//this is OK

	max_v[threadIdx.x]=threadIdx.x;

	max_v[0]=10;

//SOME DECLARATIONS

if ((d1==0) || (d1>=d))

	{	

		isLOS[threadIdx.x]=LOS;

		isFresnel[threadIdx.x]=Fresnel;

		return;

	}

	else

	{

		float R1_fz = sqrt(lambda*d1*(d-d1)/d);

		float Zter;

		if(d_grid == 0)

			Zter = 0.0;

		else

		{

			float q = Xp;

			float w = Yp;

			q*=scaleC;

			w*=scaleR;

			int a =  (q/grid_resolution);

			int b =  (w/grid_resolution);

			Zter = d_grid[a][b];

		}

					

		float h_ter=Zter-Zp;

		float Rxy=R1_fz/cos_alfa;

		float R=Rxy*percent_/100;

			

		if(h_ter >= 0)

		{

			LOS=Fresnel=false;

			float v=h_ter*sqrt_two/Rxy;

			if(fabs(max_h[threadIdx.x]-h_ter)<0.00001)

			{

				//THIS IS THE PROBLEM

								//if these two line are not commented the variables don't get new values

				max_h[threadIdx.x]=h_ter;

					max_v[7]=-10;//v;

				int dummy;

			}

			else

			{

				if (h_ter > max_h[threadIdx.x])

				{

						//THIS IS THE PROBLEM

										//if these two line are not commented the variables don't get new values

					max_h[threadIdx.x]=h_ter;

					  max_v[threadIdx.x]=-10;//v;

					int dunmmy;			

				}

			}

		}

}

}

//this is OK

	isLOS[threadIdx.x]=0;

	max_h[threadIdx.x]=2;

	max_v[threadIdx.x]=4;

	isFresnel[threadIdx.x]=1;

return;				

}

Don’t know if I got the braces right because it is a cut-out, but I hope it demonstrates what I am trying to explain.

The variables are arrays and I use them to get the values out of the kernel.

The beggining and end assignemt of values is just for testing, what I actually need is the middle part assignment.

And sorry for not formating code before :)

Thanks.

vojdan · January 29, 2010, 7:17pm

Hello guys, I still haven’t fixed the problem but I’m working hard on it.

Here is what i found out and I don’t know why I haven’t checked it before.

When in DEBUG mode, the Output window says:

'cppIntegration.exe': Loaded 'C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin\win32\Debug\cppIntegration.exe', Symbols loaded.

'cppIntegration.exe': Loaded 'C:\Windows\System32\ntdll.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\kernel32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\KernelBase.dll'

'cppIntegration.exe': Loaded 'C:\CUDA\bin\cudart32_30_8.dll', Binary was not built with debug information.

'cppIntegration.exe': Loaded 'C:\Windows\System32\nvcuda.dll', Binary was not built with debug information.

'cppIntegration.exe': Loaded 'C:\Windows\System32\user32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\gdi32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\lpk.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\usp10.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\msvcrt.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\advapi32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\sechost.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\rpcrt4.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\imm32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\msctf.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\nvapi.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\ole32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\oleaut32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\shlwapi.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\shell32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\setupapi.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\cfgmgr32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\devobj.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\version.dll'

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f728..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f728..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f728..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f728..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f728..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f758..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f758..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f758..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f758..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f758..

'cppIntegration.exe': Loaded 'C:\Windows\System32\apphelp.dll'

The thread 'Win32 Thread' (0xef4) has exited with code -1073741510 (0xc000013a).

The thread 'Win32 Thread' (0xcac) has exited with code -1073741510 (0xc000013a).

The program '[504] cppIntegration.exe: Native' has exited with code -1073741510 (0xc000013a).

Can someone make sense of this?

I read somewhere that this exception might be caused because of using more shared memory than it is available so i printed this:

SHARED MEMORY SIZE - MAX_V:160  MAX_H:160  LOS:40  FRESNEL:40  GRID:2500

NUMBER OF THREADS: 40

MEMORY SIZE OF ARGUMENTS: 92

TOTAL = SHARED(2900) + ARGUMENTS(3680) = 6580

The share memory line are arrays which are passed to all the threads and used to cooperate between them.

The size of the arguments is multiplied by the number of threads and the sum is a total of 6580B.

Isn’t this bellow the shared memory limit?

please help.

Thanks,

Vojdan.

Topic		Replies	Views
Beginer question Thread synchronization with shared memory CUDA Programming and Performance	35	9314	April 6, 2010
How to debug kernel throwing an exception? CUDA Programming and Performance	16	7954	June 14, 2013
Help with strange error CUDA Programming and Performance	8	2096	February 25, 2010
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2634	August 2, 2010
Emulator works but G80 doesn't CUDA Programming and Performance	11	5432	July 3, 2007
Bug report: Threads out of sync, branched syncthreads problem CUDA Programming and Performance	2	1644	November 30, 2009
CUDA Release vs. Debug CUDA Programming and Performance	10	5177	May 13, 2010
why result varied based on different number of threads per block? CUDA Programming and Performance	8	1943	March 1, 2011
emu vs debug, different values CUDA Programming and Performance	48	15730	February 5, 2009
cuda Error at memory location CUDA Programming and Performance	1	2583	April 5, 2012

Emulation works, Debug doesn't

Related topics