Kernel doesn't return correct values but in emulation mode it does

tera · July 25, 2010, 8:55am

Another question is:

If I wanted to alloc every field of the struct in device, how can I did that? For example if I have this struct:
typedef struct {

   int *integer;

   float *floating_point;

} structure
How can I alloc integer and floating_point fields in device, since I can’t dereferencing them in host code?

Just use [font=“Courier New”]cudaMalloc((void **)&(structure.int), N*sizeof(int))[/font] as normal. cudaMalloc() does not dereference the pointer passed to it, and actually can only be called in host code.

cudamast1973 · July 25, 2010, 11:14am

Hi!

I’m facing a problem with the following kernel:

__global__ void GALfilterKern(kernArgs *kArgs, Complex *deviceSig_filt, Complex *d_Y)

{

	int i,n,m;

		

	// ===== INPUT PARAMETERS ===

	const float delta = 1e-2f;   // small positive constant for "desired response"

	const float beta = 0.8f;

	const float mhu = 0.08f;

	

	// ===== INITIALIZATION =====

	float absE_f = 0.0f;

	float absE_b = 0.0f;

	

	// ===== APPLICATION TO INPUT SIGNAL FOR EACH SAMPLE===

	for(n=0; n<kArgs->sig_length; ++n){

		// data in

		Complex u = {kArgs->sig[n].real,kArgs->sig[n].img};

		Complex d = {kArgs->sig_d[n].real,kArgs->sig_d[n].img};

	

		// forward and backward error initialization

		kArgs->E_f[0].real = u.real;

		kArgs->E_f[0].img = u.img;

		kArgs->E_b[1][0].real = u.real;

		kArgs->E_b[1][0].img = u.img;

	

		// desired response at time n and stage "-1"

		kArgs->y[0] = c_mul(c_con(kArgs->h[0]),kArgs->E_b[1][0]);

		kArgs->err[0] = c_sub(d,kArgs->y[0]);

		absE_b = c_abs(kArgs->E_b[1][0]);

		kArgs->norm_b[0] = delta + (absE_b*absE_b);

		Complex mn = {mhu/kArgs->norm_b[0],0.0f};

		Complex partMul = c_mul(kArgs->E_b[1][0],c_con(kArgs->err[0]));

		kArgs->h[0] = c_add(kArgs->h[0],c_mul(mn,partMul));

	

		for(m=1; m<M_DEFAULT+1; ++m){

			absE_f = c_abs(kArgs->E_f[m-1]);

			absE_b = c_abs(kArgs->E_b[0][m-1]);

			kArgs->Energy[m-1] = beta * kArgs->Energy[m-1] + (1-beta) * ((absE_f*absE_f) + (absE_b*absE_b));		

			kArgs->E_f[m] = c_add(kArgs->E_f[m-1],c_mul(c_con(kArgs->k[m-1]),kArgs->E_b[0][m-1]));

			kArgs->E_b[1][m] = c_add(kArgs->E_b[0][m-1],c_mul(kArgs->k[m-1],kArgs->E_f[m-1]));

			Complex mE = {mhu/kArgs->Energy[m-1],0.0f};

			Complex firstMul = c_mul(c_con(kArgs->E_f[m-1]),kArgs->E_b[1][m]);

			Complex secondMul = c_mul(kArgs->E_b[0][m-1],c_con(kArgs->E_f[m]));

			kArgs->k[m-1] = c_sub(kArgs->k[m-1],c_mul(c_add(firstMul,secondMul),mE));		

			// desired response

			kArgs->y[m] = c_add(kArgs->y[m-1],c_mul(c_con(kArgs->h[m]),kArgs->E_b[1][m]));

			kArgs->err[m] = c_sub(d,kArgs->y[m]);

			absE_b = c_abs(kArgs->E_b[1][m]);

			kArgs->norm_b[m] = kArgs->norm_b[m-1] + (absE_b*absE_b);

			Complex mn_b = {mhu/kArgs->norm_b[m],0.0f};

			kArgs->h[m] = c_add(kArgs->h[m],c_mul(mn_b,c_mul(kArgs->E_b[1][m],c_con(kArgs->err[m]))));

		}

		for(i=0; i<M_DEFAULT+1; ++i)

			kArgs->E_b[0][i]=kArgs->E_b[1][i];

		d_Y[n] = kArgs->y[m-1];

		deviceSig_filt[n] = kArgs->err[m-1];

	}

}

For testing purposes, I want to launch kernel with one thread only; so in main() I wrote:

GALfilterKern<<<1,1>>>(dArgs,deviceSig_filt,d_Y);  // deviceSig_filt contains results

The problem is that results are wrong (and different at every launch), while in emulation mode are correct. It’s strange because I’ve only a thread that execute the kernel. What should I do?

Question for you : are you debugging in emulation mode using SKD 2.3?

Manugal · July 25, 2010, 5:38pm

Thanks to all for response. :)

I’ve debugging with SDK 3.0 on a Linux x64 machine.

Now the situation is the following:

[Before changing in code]:

Emulation Mode: Correct
Normal Mode: Incorrect

[After changing in code suggested by SPWorley]:

Emulation Mode: Incorrect
Normal Mode: Incorrect

In both emulation and normal mode results are the same and are more correct than before changing in code; in particular, I’ve an output of about 240000 numbers and only the first three are correct, the rest is wrong.

SPWorley · July 25, 2010, 6:04pm

Since emulation is incorrect, this means your math logic may have accidentally changed, or more likely, your array indexing, perhaps from flattening your 2D arrays into 1D.

You can use valgrind on emulator runs, and Ocelot or cudaMemcheck on GPU code to help find memory access errors.

Manugal · July 25, 2010, 7:37pm

Finally, I did it!!! External Image

Thanks SPWorley, you was right. The indexing of E_b (the array flattened) was wrong. Now it works perfectly and gives me correct results.

Thanks to all. ;)

Topic		Replies	Views
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11195	May 23, 2010
Performance of passing structs to kernel by value / by reference CUDA Programming and Performance	18	3345	January 31, 2020
Bandwidth & Kernel problems: performance degredation. CUDA Programming and Performance	8	5120	December 6, 2010
emu vs debug, different values CUDA Programming and Performance	48	15751	February 5, 2009
Problem with texture memory CUDA Programming and Performance	10	1757	August 6, 2010
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	3065	November 13, 2017
Unable to unwind stack frame ...when debugging with cuda-gdb CUDA Programming and Performance	5	1462	July 5, 2010
multi dimension array CUDA Programming and Performance	26	32792	February 12, 2010
Another Device Memory Question CUDA Programming and Performance	7	2315	February 9, 2010
pointer in global device memory CUDA Programming and Performance	9	11622	November 23, 2011

Kernel doesn't return correct values but in emulation mode it does

Related topics