Hi,
I am writing a kernel that basically evaluate interaction between all bodies (like in Mark Harris’ “Fast nBody Simulation”) and do something for every as calculated point. For every interaction kernel sums NN values for every parameter. In order to avoid NN local memories writings for every value of parameter (max value of this is some thousands) I have allocated an array in shared memory, in this way:
[codebox]
extern shared float4 sharedMemory;
float4 *sharedPosition = sharedMemory; // size: numberOfthreads times size of (float4)
float kernelIntensity = ( float ) ( sharedMemory ) + sharedMemoryIdx;
with sharedMemory size:
int ShMem = 2 * threadsNumber * sizeof( float4 );
and:
unsigned int sharedMemoryIdx = 4 * thNumber;
[/codebox]
where sharedPosiition is an array useful to store atomic positions (like in “Fast nBody Simulation”) and kernelIntensity an array in which sum N*N values in each position.I mean:
[codebox]
for i…
for j…
(first two loops for calculating body interaction)
for ( int q = qLim.x; q < qLim.y; q ++ )
{
kernelIntensity[ q + sharedMemoryIdx ] += f( qrij );
}
[/codebox]
For room reasons, then, I have to partially reduce kernelIntensity on device before memcpy it to host (outside i and j loops), so I need kernelIntensity array.
Now, my problem is that, maybe because of float cast ( ? ) or wrong access index when I do calculation, I am not able to find a value for sharedMemoryIdx that gives me right results…
Any idea?
Please note that if kernelIntensity is declared in local memory results are right.
thank to anyone would give me a suggest,
cheers,
luca