Kernel doesn't return correct values but in emulation mode it does


I’m facing a problem with the following kernel:

__global__ void GALfilterKern(kernArgs *kArgs, Complex *deviceSig_filt, Complex *d_Y)


	int i,n,m;


	// ===== INPUT PARAMETERS ===

	const float delta = 1e-2f;   // small positive constant for "desired response"

	const float beta = 0.8f;

	const float mhu = 0.08f;


	// ===== INITIALIZATION =====

	float absE_f = 0.0f;

	float absE_b = 0.0f;



	for(n=0; n<kArgs->sig_length; ++n){

		// data in

		Complex u = {kArgs->sig[n].real,kArgs->sig[n].img};

		Complex d = {kArgs->sig_d[n].real,kArgs->sig_d[n].img};


		// forward and backward error initialization

		kArgs->E_f[0].real = u.real;

		kArgs->E_f[0].img = u.img;

		kArgs->E_b[1][0].real = u.real;

		kArgs->E_b[1][0].img = u.img;


		// desired response at time n and stage "-1"

		kArgs->y[0] = c_mul(c_con(kArgs->h[0]),kArgs->E_b[1][0]);

		kArgs->err[0] = c_sub(d,kArgs->y[0]);

		absE_b = c_abs(kArgs->E_b[1][0]);

		kArgs->norm_b[0] = delta + (absE_b*absE_b);

		Complex mn = {mhu/kArgs->norm_b[0],0.0f};

		Complex partMul = c_mul(kArgs->E_b[1][0],c_con(kArgs->err[0]));

		kArgs->h[0] = c_add(kArgs->h[0],c_mul(mn,partMul));


		for(m=1; m<M_DEFAULT+1; ++m){

			absE_f = c_abs(kArgs->E_f[m-1]);

			absE_b = c_abs(kArgs->E_b[0][m-1]);

			kArgs->Energy[m-1] = beta * kArgs->Energy[m-1] + (1-beta) * ((absE_f*absE_f) + (absE_b*absE_b));		

			kArgs->E_f[m] = c_add(kArgs->E_f[m-1],c_mul(c_con(kArgs->k[m-1]),kArgs->E_b[0][m-1]));

			kArgs->E_b[1][m] = c_add(kArgs->E_b[0][m-1],c_mul(kArgs->k[m-1],kArgs->E_f[m-1]));

			Complex mE = {mhu/kArgs->Energy[m-1],0.0f};

			Complex firstMul = c_mul(c_con(kArgs->E_f[m-1]),kArgs->E_b[1][m]);

			Complex secondMul = c_mul(kArgs->E_b[0][m-1],c_con(kArgs->E_f[m]));

			kArgs->k[m-1] = c_sub(kArgs->k[m-1],c_mul(c_add(firstMul,secondMul),mE));		

			// desired response

			kArgs->y[m] = c_add(kArgs->y[m-1],c_mul(c_con(kArgs->h[m]),kArgs->E_b[1][m]));

			kArgs->err[m] = c_sub(d,kArgs->y[m]);

			absE_b = c_abs(kArgs->E_b[1][m]);

			kArgs->norm_b[m] = kArgs->norm_b[m-1] + (absE_b*absE_b);

			Complex mn_b = {mhu/kArgs->norm_b[m],0.0f};

			kArgs->h[m] = c_add(kArgs->h[m],c_mul(mn_b,c_mul(kArgs->E_b[1][m],c_con(kArgs->err[m]))));


		for(i=0; i<M_DEFAULT+1; ++i)


		d_Y[n] = kArgs->y[m-1];

		deviceSig_filt[n] = kArgs->err[m-1];



For testing purposes, I want to launch kernel with one thread only; so in main() I wrote:

GALfilterKern<<<1,1>>>(dArgs,deviceSig_filt,d_Y);  // deviceSig_filt contains results

The problem is that results are wrong (and different at every launch), while in emulation mode are correct. It’s strange because I’ve only a thread that execute the kernel. What should I do?

You have some complex host setup code that you didn’t include, making it hard to analyze.

Likely it’s in your input data. You’re possibly sharing some host pointers, not device pointers.
This may be hidden inside your data complexity like the kernArgs structure… are you sure it’s packed with DEVICE pointers with proper memcopy initiialization? This is always tricky to do, especially since you have these indirections and multiple dimensional arrays.

In emulation mode, host pointers are (unfortunately) usable by the device since they share the same memory space. This is one of the two common causes for “the emulator works, but the GPU doesn’t!” (The other is thread races, but you’re testing with a single thread so that’s not a problem.)

If you run this through Ocelot or even the toolkit’s memcheck tool it may immediately show you a memory pointer error without any detailed human analysis needed.

Thanks for your answer.

If I launch cuda-memcheck it gives me this output:

[emanuele@diablo GAL_CUDA_ONETHD]$ cuda-memcheck ./gal


Elapsed time: 586.000000 microsec.

========= Internal Error

The kernArgs structure is the following:

typedef struct {

	const Complex *sig;

	const Complex *sig_d;

	float *Energy;	// total energy of prediction error at the input of the m-th stage

	Complex *h;	// filter taps

	float *norm_b;

	Complex *E_f;	// forward error

	Complex **E_b;	// backward error, old 1st row, new 2nd row

	Complex *k;	// reflection coefficient

	Complex *y;

	Complex *err;

	int sig_length;

} kernArgs;

This struct is allocated by the following function (in the host code):

kernArgs *allocHostArgs()


	int i,sig_length;

	const float a = 1e-2f;   // small positive constant

	kernArgs *kArgs = (kernArgs *) malloc (sizeof(kernArgs));


		PrintError("Error malloc() kArgs:");

	kArgs->sig = load_input("sig.txt",&sig_length);

	kArgs->sig_d = load_input("sig_d.txt",&sig_length);

	kArgs->sig_length = sig_length;

	kArgs->Energy = (float *) malloc (M_DEFAULT*sizeof(float));


		PrintError("Error malloc() kArgs->Energy:");	

	for(i=0; i<M_DEFAULT; ++i)

		kArgs->Energy[i] = a;

	kArgs->h = (Complex *) calloc (M_DEFAULT+1,sizeof(Complex));


		PrintError("Error calloc() kArgs->h:");

	kArgs->norm_b = (float *) calloc (M_DEFAULT+1,sizeof(float));


		PrintError("Error calloc() kArgs->norm_b:");

	kArgs->E_f = (Complex *) malloc (M_DEFAULT*sizeof(Complex));


		PrintError("Error malloc() kArgs->E_f:");

	kArgs->E_b = (Complex **) malloc (2*sizeof(Complex *));


		PrintError("Error malloc() kArgs->E_b:");

	kArgs->E_b[0] = (Complex *) calloc (M_DEFAULT+1,sizeof(Complex));


		PrintError("Error calloc() kArgs->E_b[0]:");

	kArgs->E_b[1] = (Complex *) calloc (M_DEFAULT+1,sizeof(Complex));


		PrintError("Error calloc() kArgs->E_b[1]:");

	kArgs->k = (Complex *) malloc (M_DEFAULT*sizeof(Complex));


		PrintError("Error malloc() kArgs->k:");

	kArgs->y = (Complex *) malloc ((M_DEFAULT+1)*sizeof(Complex));


		PrintError("Error malloc() kArgs->y:");

	kArgs->err = (Complex *) malloc ((M_DEFAULT+1)*sizeof(Complex));


		PrintError("Error malloc() kArgs->err:");

	return kArgs;


So in main(), I do:


cudaMalloc((void **) &deviceSig_filt, hArgs->sig_length*sizeof(Complex));

cudaMalloc((void **) &d_Y, hArgs->sig_length*sizeof(Complex));

kernArgs *dArgs,*hArgs;

hArgs = allocHostArgs();

cudaMalloc((void **) &dArgs, sizeof(kernArgs));





I don’t know what’s wrong?

You’re stuffing a dozen host pointers into a structure and sending that structure to the GPU. This is not useful or correct.

You need to allocate DEVICE memory, using cudaMalloc, and pass those device pointers. The GPU can do absolutely nothing with a host pointer.

It will also be a lot clearer to you mentally if you throw out the top level kernArgs structure and just pass all the pointers as many arguments to the kernel.
That may seem unclean and inelegant to pass a dozen arguments, but it’s much clearer and easier to understand. If you had done this, it would have been easy to see “oh, I’m evaluating E_b[17], but E_b is a host pointer!”.

Passing the pointers as arguments will also be a lot more efficient on G80 and G200 than your structure method since you’ll avoid all those top level structure indirections. On Fermi, they’ll be inefficient but cached so it’s not a big deal.

Understood. But since I’ve made this call cudaMemcpy(dArgs,hArgs,sizeof(kernArgs),cudaMemcpyHostToDevi
ce);, shouldn’t it copy all the host memory in device memory (pointers included)? And if I change pointers with static arrays, might I have the same problem?

However I can also pass the long arg list to kernel; I’ve made a struct only for elegance.

do not depend on emulation, it has nothing much to do with actual.

It is not clear how you are getting results. Are you copying back deviceSig_filt to host memory to collect results

Yes, it does copy that structure to the device. But that just holds useless host pointers, not your data! sizeof(kernArgs) is probably 44 or 88 bytes in total depending on 4 or 8 byte pointers.

If you had storage for static length arrays inside a structure, then it would probably work because that’s not a pointer, it’s real data.

If you are using 64bit OS(linux) no size problem for pointers.

Thanks a lot for your answers. I’ll try to pass a long list args to kernel.

Ok, I’ve changed my kernel by passing all args, instead of struct.

Substantially, the kernel now appears like this:

_global__ void GALfilterKern(int sig_length, Complex *sig_d, Complex *sig, float *Energy, float *norm_b, Complex *h, Complex *E_f, Complex *E_b, Complex *k, Complex *y, Complex *err, Complex *sig_filt, Complex *Y)

And I’ve allocated all device memory in main(). E_b was a double pointer (as it is a 2M_DEFAULT+1 matrix), so I’ve “flattened” it and now it’s a 2M_DEFAULT+1 array.

Two effects:

  • The execution time is massively increased (from about 530 microsec. to about 1:20 min).

  • The results are not correct yet (even they are a litte more correct than before)

I’m comparing results with a CPU version of this algorithm. I’ve also tried to compile with -prec-div=true and -prec-sqrt=true but it’s the same. :(

I have just looked at your with bit more carefully. I have seen that ypu kArgs contains many pointers which allocated in host. and you have copied kArgs to dArgs, dArgs now contains pointers defined in host, which is not valid for Device; it is a serious mistake. before copying KArgs to dArgs you must allocate the inside pointers from GPU. In addition make sure size of pointers. If you are using 64bit Linux OS noproblem.

Thanks, but I’ve changed my kernel and now it takes single args instead of struct. Now I’ve the problems described before.

This does not solve the problem of device pointers vs. host pointers. It just makes it a bit more obvious that you have to do something about it.

I’ve resolved that problem, because now I’ve allocated every device pointer with cudaMalloc, as SPWorley said me. In fact now I’ve results still a little different as regards the originals, but they are always the same at every kernel launch (first I had results different at every kernel launch, just because there was that problem with pointers).

Now I would know how correct these results and why it takes so long to execute.

Your speed issues are a separate problem. They come from using a single thread on a GPU (you will likely want to be running something like 256 threads on each of 1000 blocks), and from having poor memory access patterns. These are big design issues… the first will hurt you by over two orders of magnitude, the second by one order of magnitude.

But take it one step at a time. Just getting code actually running on the GPU is a satisfying subgoal.

That’s right, in fact at the moment, speed is a secondary problem. However it’s strange, because before change the code, I had one thread anyway.

The results have priority and I don’t know why they are still incorrect. :(

Before the change in code, you were crashing your kernel because you were dereferencing host pointers on the device. You were just measuring how fast the kernel can crash, not how fast your computation was.

You’re not checking for any error return from your kernel launch… that would let you print an error when a kernel fails.

Ok, thanks. Understood. ;)

Any idea concerning the results?

Another question is:

If I wanted to alloc every field of the struct in device, how can I did that? For example if I have this struct:

typedef struct {

   int *integer;

   float *floating_point;

} structure

How can I alloc integer and floating_point fields in device, since I can’t dereferencing them in host code?

It’s multiple steps, prone to error, but not too hard. But there’s no reason to… the structure gives you no advantages whatsoever, and has speed penalties because of the extra dereferencing.

But the process is to create a host structure, and populate its pointers using cudaMalloc so you have a host structure filled with device pointers. Then you cudamalloc a device structure, and cudamemcopy the host structure data into the device structure memory. Then you pass the device pointer of the device struct.

Lots of extra steps with no benefit. It can admittedly be useful when you have even more arguments (like 30+), but you’re not there yet.