Emulation works, Debug doesn't

Hello,
I am new to CUDA and I have something strange happening.

If I execute under EMULATION mode, the code works fine, but if I execute in DEBUG mode (on the GPU) it doesn’t work, or more accurately doesn’t give correct results. When executing in debug mode (on gpu) the values I get back from the kernel are 0.
I modified the code of C++ integration example in the SDK because there was an existing C++ program to parallelize.

I need to give about 30 variables to the kernel, and get 4 back. Here is how I’m doing this…
C++ code:
runTest(Xr,Xt,Yr,Yt,Ys,Ye,Zs,a_xy,b_xy,a_xz,b_xz,Xs,d,lambda
,&max_h,&max_v,cos_alfa,&isFresnel,&isLOS,percent_,resolution,sin_beta,cos_beta,a_elipse,b_e
lipse,x_elipse,y_elipse,topo->grid,topo->maxX,topo->maxY,topo->grid_resolution,topo->scaleC,topo->scaleR);

cppIntegration.cu:
kernel<<<1, threadsPerBlock>>>(Xr,Xt,Yr,Yt,Zs,a_xy,b_xy,a_xz,b_xz,d,lambda,cos_alfa,per
cent_,resolution,sin_beta,cos_beta, a_elipse, b_elipse, x_elipse, y_elipse, h_grid, grid_resolution, scaleC,scaleR, d_maxh, d_maxv, d_isFresnel, d_isLOS);

The values I need to get are d_maxh, d_maxv, d_isFresnel and d_isLOS. The code for their initialization is:
double* d_maxv;
size_t sizeMaxv = numsizeof(double);
cudaMalloc((void**)&d_maxv, sizeMaxv);
double
h_maxv = (double*)malloc(sizeMaxv);

I don’t copy start values because they are calculated on the GPU, but I get the results with:
cudaMemcpy(h_maxv, d_maxv, sizeMaxv, cudaMemcpyDeviceToHost);

Does anyone have an idea?

Thanks,
Vojdan.

I just tried this with the same settings:

[indent]

runTest(…){

size_t sizetry = numsizeof(double);
double
h_try = (double*)malloc(sizetry);
for(int i=0; i<num; i++)
h_try[i]=1;

for(int i=0; i<num; i++)
	printf("%f ",h_try[i]);

double* d_try;
cudaMalloc((void**)&d_try, sizetry);
cudaMemcpy(d_try, h_try, sizetry, cudaMemcpyHostToDevice);

kernel2<<<1,threadsPerBlock>>>(d_try);

cudaMemcpy(h_try, d_try, sizetry, cudaMemcpyDeviceToHost);
for(int p=0;p<num-1;p++)
	printf("\nTRY: %f",h_try[p]);

}
[/indent]

If I run this in emulation mode the result print is 2.0 for all the entries, and if I run it in debug mode, all the result entries are 1.0 like before the kernel ran.

I am working with CUDA 3.0 on Windows 7 with GeForce 8800GTS.

So what is threadsPerBlock?

Why aren’t you checking errors? You should be checking errors.

Threads per block is the number of threads running:

const unsigned int num_threads = num;

int threadsPerBlock = num_threads;

I was reading the Programing Guide and the Best Practices, but I couldn’t quite get it :(

How do i print the result from cudaGetLastError()?

if (cudaSuccess == cudaGetLastError())

someVar = 1;

Something like this and then find out where it went wrong?

Obviously, but what is its value?

:) it’s a dynamic value, depending on the coordinates, but in this case it’s 63.

int threadsPerBlock = num_threads;

printf("NUMBER OF THREADS: %d",num_threads);

Every function returns a cudaError_t. You are probably using a host pointer instead of a device pointer at some point and your kernel isn’t actually running successfully.

Now this is totally confusing, I am doing the same thing for two arrays and one is fine, the other isn’t.

here is the code:

[indent] //try

size_t sizetry = num*sizeof(double);

//matrix

unsigned int size_A = 10 * 10;

unsigned int mem_size_A = sizeof(float) * size_A;



//try

double* h_try = (double*)malloc(sizetry);

for(int i=0; i<num; i++) h_try[i]=1;

for(int i=0; i<num; i++) printf("\ntry before: %lf ",h_try[i]);

//matrix

float* h_A = (float*) malloc(mem_size_A);

for (int i = 0; i < size_A; ++i) h_A[i] = rand() / (float)50;

for(int p=0;p<num;p++) printf("\nmatrix before: %f",h_A[p]);

//try

double* d_try;

cutilSafeCall(cudaMalloc((void**)&d_try, sizetry));

cutilSafeCall(cudaMemcpy(d_try, h_try, sizetry, cudaMemcpyHostToDevice));

//matrix

float* d_A;

cutilSafeCall(cudaMalloc((void**) &d_A, mem_size_A));

cutilSafeCall(cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice) );

kernel2<<<1,threadsPerBlock>>>(d_try,d_A);

//try

cutilSafeCall(cudaMemcpy(h_try, d_try, sizetry, cudaMemcpyDeviceToHost));

//matrix

cutilSafeCall(cudaMemcpy(h_A, d_A, mem_size_A,  cudaMemcpyDeviceToHost) );

for(int p=0;p<num;p++)

	printf("\ntry after: %lf",h_try[p]);

for(int p=0;p<num;p++)

	printf("\nmatrix after: %lf",h_A[p]);[/indent]

and the kernel is:

[indent]global void

kernel2( double* g_data, float* g_data2 )

{

    g_data[threadIdx.x]++;

g_data2[threadIdx.x]++;

}[/indent]

the result for “try” is 1 before and after the kernel execution.

the result for “matrix” is random before and random+1 after!?!?

What am I missing?? :mellow: :mellow:

Could you post a compact, complete, compilable, runnable example which illustrates your problem? Posting random code snippets and expecting others to guess what might be going on is neither helpful nor considerate. It also makes life considerably easier for others if you post your code using the code formatting the forum software provides. Like this:

__global__ void kernel2( double* g_data, float* g_data2 )

{

	g_data[threadIdx.x]++;

	g_data2[threadIdx.x]++;

}

It’s quite a big project so I doubt I could send a part of it, but I would gladly send you the hole project in a private message. Can I somehow do that?

While debugging, I noticed something strange. If I give a value to some specific variables at the beggining or at the end of the kernel, they get that value.

But, if I give a value to the variables somewhere in the middle of the kernel, it looks like no value is given to them at all, and they have the starting value.

This is the kernel:

__global__ void

kernel(float Xr, float Xt, float Yr, float Yt,  float Zs, float a_xy, float b_xy, float a_xz, float b_xz, float d, float lambda,  float cos_alfa, int percent_, int resolution, float sin_beta, float cos_beta, float a_elipse, float b_elipse, float x_elipse, float y_elipse,unsigned char** d_grid, float grid_resolution, float scaleC, float scaleR, float* max_h, float* max_v, bool *isFresnel, bool* isLOS)

{	

//this is OK

	max_v[threadIdx.x]=threadIdx.x;

	max_v[0]=10;

//SOME DECLARATIONS

if ((d1==0) || (d1>=d))

	{	

		isLOS[threadIdx.x]=LOS;

		isFresnel[threadIdx.x]=Fresnel;

		return;

	}

	else

	{

		float R1_fz = sqrt(lambda*d1*(d-d1)/d);

		float Zter;

		if(d_grid == 0)

			Zter = 0.0;

		else

		{

			float q = Xp;

			float w = Yp;

			q*=scaleC;

			w*=scaleR;

			int a =  (q/grid_resolution);

			int b =  (w/grid_resolution);

			Zter = d_grid[a][b];

		}

					

		float h_ter=Zter-Zp;

		float Rxy=R1_fz/cos_alfa;

		float R=Rxy*percent_/100;

			

		if(h_ter >= 0)

		{

			LOS=Fresnel=false;

			float v=h_ter*sqrt_two/Rxy;

			if(fabs(max_h[threadIdx.x]-h_ter)<0.00001)

			{

				//THIS IS THE PROBLEM

								//if these two line are not commented the variables don't get new values

				max_h[threadIdx.x]=h_ter;

					max_v[7]=-10;//v;

				int dummy;

			}

			else

			{

				if (h_ter > max_h[threadIdx.x])

				{

						//THIS IS THE PROBLEM

										//if these two line are not commented the variables don't get new values

					max_h[threadIdx.x]=h_ter;

					  max_v[threadIdx.x]=-10;//v;

					int dunmmy;			

				}

			}

		}

}

}

//this is OK

	isLOS[threadIdx.x]=0;

	max_h[threadIdx.x]=2;

	max_v[threadIdx.x]=4;

	isFresnel[threadIdx.x]=1;

return;				

}

Don’t know if I got the braces right because it is a cut-out, but I hope it demonstrates what I am trying to explain.

The variables are arrays and I use them to get the values out of the kernel.

The beggining and end assignemt of values is just for testing, what I actually need is the middle part assignment.

And sorry for not formating code before :)

Thanks.

Hello guys, I still haven’t fixed the problem but I’m working hard on it.

Here is what i found out and I don’t know why I haven’t checked it before.

When in DEBUG mode, the Output window says:

'cppIntegration.exe': Loaded 'C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin\win32\Debug\cppIntegration.exe', Symbols loaded.

'cppIntegration.exe': Loaded 'C:\Windows\System32\ntdll.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\kernel32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\KernelBase.dll'

'cppIntegration.exe': Loaded 'C:\CUDA\bin\cudart32_30_8.dll', Binary was not built with debug information.

'cppIntegration.exe': Loaded 'C:\Windows\System32\nvcuda.dll', Binary was not built with debug information.

'cppIntegration.exe': Loaded 'C:\Windows\System32\user32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\gdi32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\lpk.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\usp10.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\msvcrt.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\advapi32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\sechost.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\rpcrt4.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\imm32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\msctf.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\nvapi.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\ole32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\oleaut32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\shlwapi.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\shell32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\setupapi.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\cfgmgr32.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\devobj.dll'

'cppIntegration.exe': Loaded 'C:\Windows\System32\version.dll'

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f728..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f728..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f728..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f728..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f728..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f758..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f758..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f758..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f758..

First-chance exception at 0x75ae9617 in cppIntegration.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012f758..

'cppIntegration.exe': Loaded 'C:\Windows\System32\apphelp.dll'

The thread 'Win32 Thread' (0xef4) has exited with code -1073741510 (0xc000013a).

The thread 'Win32 Thread' (0xcac) has exited with code -1073741510 (0xc000013a).

The program '[504] cppIntegration.exe: Native' has exited with code -1073741510 (0xc000013a).

Can someone make sense of this?

I read somewhere that this exception might be caused because of using more shared memory than it is available so i printed this:

SHARED MEMORY SIZE - MAX_V:160  MAX_H:160  LOS:40  FRESNEL:40  GRID:2500

NUMBER OF THREADS: 40

MEMORY SIZE OF ARGUMENTS: 92

TOTAL = SHARED(2900) + ARGUMENTS(3680) = 6580

The share memory line are arrays which are passed to all the threads and used to cooperate between them.

The size of the arguments is multiplied by the number of threads and the sum is a total of 6580B.

Isn’t this bellow the shared memory limit?

please help.

Thanks,

Vojdan.