Strange problem with a __device__ function

Hi!

I’ve written a kernel that does some calculations. At a certain point, my kernel call the following function:

__global__ void mykern(....)

{

		......

		float2 r[N3CELL2]; // N3CELL2 = 1536

		......

		xcorr(x,y,r);

		.......

}

__device__ void xcorr(float2 *x, float2 *y, float2 *r)

{

	int delay,i,j,k=0;

	float2 nf2,mx,my,sx,sy,sxy,denom,partSubx,partSuby;	

	const int n=N3CELL;   // N3CELL = 768

	const int maxdelay=N3CELL;

	float2 newy[n];

	for(i=0; i<NCELL_DEF; ++i)  // NCELL_DEF = 256

		newy[i] = y[i];

	for(; i<N3CELL; ++i){

		newy[i].x = 0.0f;

		newy[i].y = 0.0f;

	}

/* Calculate the mean of the two series x[], y[] */

	mx.x = 0.0f;

	mx.y = 0.0f;   

	my.x = 0.0f;

	my.y = 0.0f;   

   for (i=0;i<n;i++) {

	  mx = c_add(mx,x[i]);

	  my = c_add(my,newy[i]);

   }

	nf2.x = n;

	nf2.y = 0.0f;

   mx = c_div(mx,nf2);

   my = c_div(my,nf2);

/* Calculate the denominator */

   sx.x = 0.0f;

	sx.y = 0.0f;   

	sy.x = 0.0f;

	sy.y = 0.0f;   

	for (i=0;i<n;i++) {

		partSubx = c_sub(x[i],mx);

		partSuby = c_sub(newy[i],my);

	  sx = c_add(sx,c_mul(partSubx,partSubx));

	  sy = c_add(sy,c_mul(partSuby,partSuby));

   }

   denom = c_sqrt(c_mul(sx,sy));

/* Calculate the correlation series */

   for (delay=-maxdelay;delay<maxdelay;++delay) {

	  sxy.x = 0.0f;

	sxy.y = 0.0f;

	  for (i=0;i<n;++i) {

		 j = i + delay;

		 if (j < 0 || j >= n)

			continue;

		 else

			sxy = c_add(sxy,c_mul(c_sub(x[i],mx),c_sub(newy[j],my)));

	  }

	  r[k] = c_div(sxy,denom);

	  ++k;

	  /* r is the correlation coefficient at "delay" */

   }

}

The problem is that it returns me an array with all NaN values. It’s strange because the same function, written in plain C on another program, works very well. I’ve also tried to launch my kernel with one thread only, but the situation doesn’t change. I don’t know why :(

Does xcorr() have a prototype before mykern()?

Does xcorr() have a prototype before mykern()?

Yes, it does. It has a prototype in a include file (I’ve not posted it for brevity).

I’ve partially resolved my problem. Under xcorr() I’ve changed:

const int n=N3CELL;   // N3CELL = 768

const int maxdelay=N3CELL;

into:

const int n=N2CELL; // N2CELL = 512

const int maxdelay=N2CELL

I’ve made this change because xcorr() must take two N-length vectors and returns one 2N-length vector (not 3N-length vector). Indeed, in the last loop, I had out of bounds indexing problem on array x.

Now I’ve still some NaN, but seems only in thread 32 (??). Now I’ll put a more complete code of my_kern():

// NCELL_DEF = 256

// N2CELL = 512

// N3CELL = 768

// N4CELL = 1024

// Nblock = 1014

// snap_shot is a matrix "flattened" to array and its length is Nblock*NCELL_DEF

__global__ void ambfunc_kern(float *hNc, float *hNblock, int Nblock, int block_size, cufftComplex *snap_shot)

{ 

	int i,j;

	float2 rx_snap[N2CELL];

	float2 ref_snap[NCELL_DEF];

	float2 app[N4CELL] = {0};

	float2 snap_tmp,hNcf2,hNblockf2;

	snap_tmp.x = 0.0f;

	snap_tmp.y = 0.0f;	

	int tt = blockIdx.x * block_size + threadIdx.x;

	for(i=0; i<NCELL_DEF; ++i){

		ref_snap[i] = tex1Dfetch(texRefsig,tt*NCELL_DEF+i);

		rx_snap[i] = tex1Dfetch(texRefsigfilt,tt*NCELL_DEF+i);	// Select the reference signal samples for the current snapshot

		hNcf2.x = hNc[i];	// float2 value of hamming(Nc)

		hNcf2.y = 0.0f;

		ref_snap[i] = c_mul(ref_snap[i],hNcf2);	// Select the received signal samples for the current snapshot

		snap_tmp = c_add(snap_tmp,c_mul(ref_snap[i],ref_snap[i]));	

	}			

		

	for(; i<N2CELL; ++i)

		rx_snap[i] = tex1Dfetch(texRefsigfilt,tt*NCELL_DEF+i);		

			

	snap_tmp = c_sqrt(snap_tmp);

	

	for(i=0; i<NCELL_DEF; ++i)

		ref_snap[i] = c_div(ref_snap[i],snap_tmp);

		

	xcorr(rx_snap,ref_snap,app);	// Correlation function

	for(i=0, j=N2CELL; i<NCELL_DEF && j<N3CELL; ++i, ++j){

		snap_shot[tt*NCELL_DEF+i].x = app[j].x;	// Selection of useful samples of autocorrelation function 

		snap_shot[tt*NCELL_DEF+i].y = app[j].y;

	}	

	for(i=0; i<Nblock; ++i){

		hNblockf2.x = hNblock[i];	// float2 value of hamming(Nblock)

		hNblockf2.y = 0.0f;		

		snap_shot[NCELL_DEF*i+tt].x = snap_shot[NCELL_DEF*i+tt].x * hNblockf2.x - snap_shot[NCELL_DEF*i+tt].y * hNblockf2.y;

		snap_shot[NCELL_DEF*i+tt].y = snap_shot[NCELL_DEF*i+tt].x * hNblockf2.y + hNblockf2.x * snap_shot[NCELL_DEF*i+tt].y; 

	}

}

cuda-memcheck (enabled via cuda-gdb) returns me this message:

Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.

[Switching to CUDA Kernel 0 (<<<(1,0),(32,0,0)>>>)]

0x000000001fb3e7e8 in ambfunc_kern (hNc=0x5100000, hNblock=0x55fb000, 

	Nblock=1014, block_size=256, snap_shot=0x5600000) at main.cu:54

54			snap_shot[NCELL_DEF*i+tt].x = snap_shot[NCELL_DEF*i+tt].x * hNblockf2.x - snap_shot[NCELL_DEF*i+tt].y * hNblockf2.y;

cuda-memcheck standalone tool returns me the following message:

========= CUDA-MEMCHECK

========= Invalid read of size 4

=========	 at 0x000050a8 in main.cu:2282:ambfunc_kern

=========	 by thread (32,0,0) in block (1,0)

========= Address 0x057fb104 is out of bounds

=========

========= ERROR SUMMARY: 1 error

I think this is the main problem and I don’t know why the indexing is out of bounds. Thanks in advance.

Yes, it does. It has a prototype in a include file (I’ve not posted it for brevity).

I’ve partially resolved my problem. Under xcorr() I’ve changed:

const int n=N3CELL;   // N3CELL = 768

const int maxdelay=N3CELL;

into:

const int n=N2CELL; // N2CELL = 512

const int maxdelay=N2CELL

I’ve made this change because xcorr() must take two N-length vectors and returns one 2N-length vector (not 3N-length vector). Indeed, in the last loop, I had out of bounds indexing problem on array x.

Now I’ve still some NaN, but seems only in thread 32 (??). Now I’ll put a more complete code of my_kern():

// NCELL_DEF = 256

// N2CELL = 512

// N3CELL = 768

// N4CELL = 1024

// Nblock = 1014

// snap_shot is a matrix "flattened" to array and its length is Nblock*NCELL_DEF

__global__ void ambfunc_kern(float *hNc, float *hNblock, int Nblock, int block_size, cufftComplex *snap_shot)

{ 

	int i,j;

	float2 rx_snap[N2CELL];

	float2 ref_snap[NCELL_DEF];

	float2 app[N4CELL] = {0};

	float2 snap_tmp,hNcf2,hNblockf2;

	snap_tmp.x = 0.0f;

	snap_tmp.y = 0.0f;	

	int tt = blockIdx.x * block_size + threadIdx.x;

	for(i=0; i<NCELL_DEF; ++i){

		ref_snap[i] = tex1Dfetch(texRefsig,tt*NCELL_DEF+i);

		rx_snap[i] = tex1Dfetch(texRefsigfilt,tt*NCELL_DEF+i);	// Select the reference signal samples for the current snapshot

		hNcf2.x = hNc[i];	// float2 value of hamming(Nc)

		hNcf2.y = 0.0f;

		ref_snap[i] = c_mul(ref_snap[i],hNcf2);	// Select the received signal samples for the current snapshot

		snap_tmp = c_add(snap_tmp,c_mul(ref_snap[i],ref_snap[i]));	

	}			

		

	for(; i<N2CELL; ++i)

		rx_snap[i] = tex1Dfetch(texRefsigfilt,tt*NCELL_DEF+i);		

			

	snap_tmp = c_sqrt(snap_tmp);

	

	for(i=0; i<NCELL_DEF; ++i)

		ref_snap[i] = c_div(ref_snap[i],snap_tmp);

		

	xcorr(rx_snap,ref_snap,app);	// Correlation function

	for(i=0, j=N2CELL; i<NCELL_DEF && j<N3CELL; ++i, ++j){

		snap_shot[tt*NCELL_DEF+i].x = app[j].x;	// Selection of useful samples of autocorrelation function 

		snap_shot[tt*NCELL_DEF+i].y = app[j].y;

	}	

	for(i=0; i<Nblock; ++i){

		hNblockf2.x = hNblock[i];	// float2 value of hamming(Nblock)

		hNblockf2.y = 0.0f;		

		snap_shot[NCELL_DEF*i+tt].x = snap_shot[NCELL_DEF*i+tt].x * hNblockf2.x - snap_shot[NCELL_DEF*i+tt].y * hNblockf2.y;

		snap_shot[NCELL_DEF*i+tt].y = snap_shot[NCELL_DEF*i+tt].x * hNblockf2.y + hNblockf2.x * snap_shot[NCELL_DEF*i+tt].y; 

	}

}

cuda-memcheck (enabled via cuda-gdb) returns me this message:

Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.

[Switching to CUDA Kernel 0 (<<<(1,0),(32,0,0)>>>)]

0x000000001fb3e7e8 in ambfunc_kern (hNc=0x5100000, hNblock=0x55fb000, 

	Nblock=1014, block_size=256, snap_shot=0x5600000) at main.cu:54

54			snap_shot[NCELL_DEF*i+tt].x = snap_shot[NCELL_DEF*i+tt].x * hNblockf2.x - snap_shot[NCELL_DEF*i+tt].y * hNblockf2.y;

cuda-memcheck standalone tool returns me the following message:

========= CUDA-MEMCHECK

========= Invalid read of size 4

=========	 at 0x000050a8 in main.cu:2282:ambfunc_kern

=========	 by thread (32,0,0) in block (1,0)

========= Address 0x057fb104 is out of bounds

=========

========= ERROR SUMMARY: 1 error

I think this is the main problem and I don’t know why the indexing is out of bounds. Thanks in advance.

I’ve made it. The problem derive from bad indexing of arrays. But I don’t know well how can I index them correctly.

Into the kernel I’ve written:

int tt = blockIdx.x * block_size + threadIdx.x  // where block_size is passed by arg

I need that maximum value of tt be 1014, but I can’t figure out the correct amount. In main() I did:

int block_size = atoi[(argv[1]);

int numBlocks = Nblock / block_size + (Nblock%block_size==0 ? 0 : 1);	// where Nblock is exactly 1014

But with this count, I’ll never have an even number (because 1014 is not divisible, for example, by 256 or 512). How can I do?

I’ve made it. The problem derive from bad indexing of arrays. But I don’t know well how can I index them correctly.

Into the kernel I’ve written:

int tt = blockIdx.x * block_size + threadIdx.x  // where block_size is passed by arg

I need that maximum value of tt be 1014, but I can’t figure out the correct amount. In main() I did:

int block_size = atoi[(argv[1]);

int numBlocks = Nblock / block_size + (Nblock%block_size==0 ? 0 : 1);	// where Nblock is exactly 1014

But with this count, I’ll never have an even number (because 1014 is not divisible, for example, by 256 or 512). How can I do?

Pad the input with extra values to get to a round multiple of your block size, or add a bounds check to the kernel so that only threads in the valid index range do the calculations, and the others just skip them or return early. Be aware of the implications for block and warp level synchronization level primitives if you choose the second option.

Pad the input with extra values to get to a round multiple of your block size, or add a bounds check to the kernel so that only threads in the valid index range do the calculations, and the others just skip them or return early. Be aware of the implications for block and warp level synchronization level primitives if you choose the second option.

Thanks a lot. I’ve opted for the second option. But In my kernel, what implications should I have? Every thread indexes a single value in array thanks to tt variable, right?

Thanks a lot. I’ve opted for the second option. But In my kernel, what implications should I have? Every thread indexes a single value in array thanks to tt variable, right?

I was referring to the use of things like __threadSychronize(), __all(), __any(), etc. Those can cause a kernel to hang if one one or more threads in a warp exit early when other active threads execute one of those. I only skim read your kernel and didn’t see anything like that, so you are probably OK, but it is something to keep in mind.

I was referring to the use of things like __threadSychronize(), __all(), __any(), etc. Those can cause a kernel to hang if one one or more threads in a warp exit early when other active threads execute one of those. I only skim read your kernel and didn’t see anything like that, so you are probably OK, but it is something to keep in mind.

Understood. Thanks a lot.

Understood. Thanks a lot.