Ok guys, back from hell.
I implemented the spinlock global sync on the card… External Image As you already mentioned this only works up to a certain number of blocks per processor, so I determined the number of my card. The queue length (blocks per multiprocessor) on the Quadro FX 4600 was finally tested to be 9 after three times rebooting my system because it hung. External Image Would be nice if this could be fixed in the next driver version. :) So finally I can split up computation so that at maximum 9x12 blocks are used (12 processors for the FX4600).
Here is the unoptimized hacked code snippet for the spinlocking:
__global__ void computeDistance(float *dimage1,
float *dimage2,
float *dsumvector,
int *dstate)
{
[...]
// Increment the cycle counter so we can synchronize by doing spinlocking on
// this vector by assuming all blocks have finished if the counter is in the
// same state for all blocks.
dstate[blockIdx.x + __mul24(blockIdx.y, gridDim.x)] =
dstate[blockIdx.x + __mul24(blockIdx.y, gridDim.x)] + 1;
// Do the spinlock wait with the first block, first thread and wait until
// all blocks have the same counter state.
if(threadIdx.x == 0 && threadIdx.y == 0) {
if(blockIdx.x == 0 && blockIdx.y == 0) {
sum = dsumvector[0];
// Go through the vector and wait until the two successive counters are equal,
// than sum up the synchronized element
for(i = 1; i < gridDim.x*gridDim.y; ++i) {
while(dstate[i-1] != dstate[i]) {}
sum += dsumvector[i];
}
dsumvector[__mul24(gridDim.x, gridDim.y)] = sum;
}
}
}
I still don’t get your point. :( My pseudocode looks like this:
loadImagesToSharedMemory();
do {
setParameter();
applyParameter();
distance = calculateDistance();
// !!! Here I need the overall distance
} while(distance > threshold)
Because the images stay the same over the whole while loop and I access them more than once it would be nice if I could place them in shared memory. Now if I call the functions above from the host (as separate kernels), I have to lose the images in shared memory each time I leave one of the kernels right!? How can I prevent this with your approach?