Different results results in emu mode are different from results in non-emu mode

Hi to all,

i am new in programming with CUDA and i have a problem. i am trying to write a program following “Nbody” sdk tutorial but something goes wrong. when i launch this kernel:

[codebox]global void sinc (float4 *a_d, unsigned long int N, float intensity, float inteNew)

{

int Idx = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;



float4 bi = a_d[Idx];



for (int q = 1; q < Q; q ++) {

	// compute intensity for this body

	float inte = intensity[q];

	inte += computeBodyIntensity(bi, a_d, N, inte, q);

	// store intensity

	inteNew[q] += inte;

}

}

device float computeBodyIntensity(float4 bi, float4* a_d, unsigned long int N, float inte, int q)

{

inte = 0.0f;



for (int j = 0; j < N; j ++)

{

	inte = bbinteraction(inte, a_d[j], bi, q);

	__syncthreads ();

	

}



return inte;

}

device float bbinteraction(float inten, float4 bj, float4 bi, int q)

{

float3 r = {bi.x - bj.x, bi.y - bj.y, bi.z - bj.z};

float distSq = r.x * r.x + r.y * r.y + r.z * r.z;

if (distSq != 0) {

	

	// N couples not calculated

	float dist = sqrt(distSq);

	float step = q * Qstep * dist;

	inten += sin(step) / step;

} else {

	

	inten ++;



}



return inten;

}[/codebox]

i launch the kernel in this way:

[codebox] dim3 dimGrid(N / ThNumber, 1); // number of blocks in grid

	dim3 dimBlock(ThNumber, 1, 1);		// number of threads in each block

	sinc <<< dimGrid, dimBlock >>> ((float4*) a_d, N, intensity, inteNew);[/codebox]

i have tested a file were N = 4 and ThNumber = 2; all fine in emu mode (for q = 1, Qstep = 0.05 -> inteNew[1] = 15.990002) but not in non-emu (inteNew[1] = 3.997500). :wacko:

maybe blocks are not able to sum intensity? i can not understand what can be the problem… i have missed some __syncthreads?

it would be great for me if anyone have any idea…

thank you to all,

luca

Problem lies here:

inteNew[q] += inte;

On deviceemu you have one thread computing every task, one by one. On device there are at least 32 threads computed exactly at the same time.

That addition translates to:

for every thread in parallel:

  • load inteNew[q] to local register R

  • increase R by inte

  • store R to inteNew[q]

last operation will effectively store one register from one thread back to inteNew[q]. All other threads will be ignored.

A correct way to do is to implement a reduction algorithm to sum up all results from all threads. There is one provided to you in SDK. If I recall correctly, it is unnecesairly complicated and not as fast as you can get, but it is a good place to start.

I understand…

thank you very much PDan, i try to implement it… you mean “Parallel Reduction” from sdk, right?

What about something like:

[codebox]device float bbinteraction(float inten, float4 bj, float4 bi, int q, float monoInt)

{

float3 r = {bi.x - bj.x, bi.y - bj.y, bi.z - bj.z};

float distSq = r.x * r.x + r.y * r.y + r.z * r.z;

if (distSq != 0) {

	

	// N couples not calculated

	float dist = sqrt(distSq);

	float step = q * Qstep * dist;

	inten += sin(step) / step;

} else {

	

	inten ++;



}

const unsigned int Idx = __mul24(blockIdx.x, ThNumber) + threadIdx.x;

monoInt[Idx] += inten;



return inten;

}[/codebox]

to store per-thread results?

i am sorry,

monoInt[Idx] = inten;

not:

monoInt[Idx] += inten;

:-)

if i do reduction on host it seems to work!

thank you very much PDan!