values depending on shared memory size

Hi to all,

I am writing a small program to calculate this function:

f[q] = sum[ i from 0 to N] sum[ j from 0 to N ] { sin( q * rij ) / ( q * rij ) }

where q is a non negative number and rij distance between atom i and j; N is number of atoms.

f[q] has to be positive for every “q”.

But something goes wrong!

This is my kernel:

[codebox]{

int Idx = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

float4 bi = a_d[Idx];

extern __shared__ float4 SharedPos[];

float inte = 0.0f;	



#pragma unroll 8

for (int j = 0; j < gridDim.x; j ++)

{

	// load body positions into shared memory

	SharedPos[threadIdx.x] = a_d[blockDim.x * j + threadIdx.x];

	// wait until all the positions are loaded

	__syncthreads ();

	

	float4 bj;

	bj = SharedPos[j];

	

	if ((bi.w > 0.0f) && (bj.w > 0.0f))

	{

		

	for (int i = 0; i < blockDim.x; i++)

	{

		float3 r = {bi.x - bj.x, bi.y - bj.y, bi.z - bj.z};

		float distSq = r.x * r.x + r.y * r.y + r.z * r.z;

		if (distSq != 0) {

			float dist = sqrt(distSq);

			float step = q * Qstep * dist;

			inte += sin(step) / step;

			__syncthreads ();

	

		} else {

	

			inte ++;

	

		}



		// let threads write an array in order to keep all results

		int Idx = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

		monoInt[Idx] = inte;

	

	}

	

	// wait till threads have finished calculating before writing new position

	__syncthreads ();

		

	}

}

}[/codebox]

I store every thread result in “monoInt” array and then i reduce it on the host (so a “monoInt” array for every q) because till now i am not able to reduce it on device… :rolleyes:

In this way i obtain f[q], for a given “q”, as:

[codebox] for (int k = 0; k < N; k ++){

			f[q] += Hm[k];

		}

[/codebox]

The strange (to me) thing is that value changes with “ShMem” size when i launch kernel in this way:

[codebox] dim3 dimGrid(TilesNumber, 1); // number of blocks in grid

		dim3 dimBlock(ThNumber, 1, 1);		// number of threads in each block

		unsigned int ShMem = 16320;

		sinc <<< dimGrid, dimBlock, ShMem >>> (q, (float4*) a_d, N, intensity, monoInt, TileModule, step);[/codebox]

Some ideas?? Thanks to anyone trying to give me some suggest! :">

The values change because sinc<<<>>> is possibly returning an error. The size of shared memory is 16384 bytes, but you can’t allocate all of it because it is used for passing parameters and other values. Check for errors after this call to verify this. (An easy way to do this is to call cudaThreadSynchronize() after your kernel and check the return value from that.)

[quote name=‘CaLu@debian’ date=‘Jun 11 2009, 03:49 PM’ post=‘551350’]

[codebox]{

		if (distSq != 0) {

			float dist = sqrt(distSq);

			float step = q * Qstep * dist;

			inte += sin(step) / step;

			__syncthreads ();

	

		} else {

	

			inte ++;

	

		}

}[/codebox]

Please follow what seibert wrote.

Also, are you running this on emulation mode? that __syncthreads() inside a if should have dead-locked your application

as far as I rememeber. You cant do it inside a if statement…

eyal

Hi seibert and eyalhir74, thank you for your replays.

No, I am not running my program in emu mode.

I have tried to write, as suggested by you:

[codebox] sinc <<< dimGrid, dimBlock, ShMem >>> (q, (float4*) a_d, N, intensity, monoInt, TileModule, step);

		int cts;

		cts = cudaThreadSynchronize();

		printf("cts %d", cts);[/codebox] 

result is “cts 0”.

But I don’t know what does it mean… All fine?

I have also tried let nvcc decide ShMem size (<<< dimGrid, dimBlock >>> ). But negative values appear for some “q” values.

Problem can be in this kernel operation?

[codebox]inte += sin(step) / step;[/codebox]

Maybe too many computations?

Thanks to both of you!!

Also moved __syncthreads() outside of “if” statement, but nothing changes.

forgot to mention…

for very small value of q, results are corrects… for q -> 0 f[q] have to result N * N, that is correct… strange behaviour only for some points… :censored:

SOLVED (I hope)… ^_^

It seems that problem is due to wrong allocated memory size…
Allocating the quantity:

blockDimension * sizeof(float4),

it seems to work…

Thanks to all!