shared memory wrong allocation?

Hi,

I am writing a kernel that basically evaluate interaction between all bodies (like in Mark Harris’ “Fast nBody Simulation”) and do something for every as calculated point. For every interaction kernel sums NN values for every parameter. In order to avoid NN local memories writings for every value of parameter (max value of this is some thousands) I have allocated an array in shared memory, in this way:

[codebox]

extern shared float4 sharedMemory;

float4 *sharedPosition = sharedMemory; // size: numberOfthreads times size of (float4)

float kernelIntensity = ( float ) ( sharedMemory ) + sharedMemoryIdx;

with sharedMemory size:

int ShMem = 2 * threadsNumber * sizeof( float4 );

and:

unsigned int sharedMemoryIdx = 4 * thNumber;

[/codebox]

where sharedPosiition is an array useful to store atomic positions (like in “Fast nBody Simulation”) and kernelIntensity an array in which sum N*N values in each position.I mean:

[codebox]

for i…

for j…

(first two loops for calculating body interaction)

for ( int q = qLim.x; q < qLim.y; q ++ )

 {

        kernelIntensity[ q + sharedMemoryIdx ] += f( qrij );

 }

[/codebox]

For room reasons, then, I have to partially reduce kernelIntensity on device before memcpy it to host (outside i and j loops), so I need kernelIntensity array.

Now, my problem is that, maybe because of float cast ( ? ) or wrong access index when I do calculation, I am not able to find a value for sharedMemoryIdx that gives me right results…

Any idea?

Please note that if kernelIntensity is declared in local memory results are right.

thank to anyone would give me a suggest,

cheers,

luca

News:

results for some q points seems to be quite right, but with a lot of “noise” in the output. Maybe I am rewriting same data locations?

I have tried to add two __syncthreads, before writing and after, but nothing seems to change:

				for ( int q = qLim.x; q < qLim.y; q ++ )
				{
					qrij = ( q + qFirstPoint ) * stepDistance;
					__syncthreads();
					kernelIntensity[ q + sharedMemoryIdx ] += f( qrij );
					__syncthreads();
				}

Another strange thing is that using fast math results changes dramatically. Maybe I am exceeding with resources usage (now I am dealing with a Gtx260, cc 1.3)?

SOLVED ( I hope … )

Wrong indexing! :-D

bye!