Euclidian distance and synchronization

Ok guys, back from hell.

@MisterAnderson42:

I implemented the spinlock global sync on the card… External Image As you already mentioned this only works up to a certain number of blocks per processor, so I determined the number of my card. The queue length (blocks per multiprocessor) on the Quadro FX 4600 was finally tested to be 9 after three times rebooting my system because it hung. External Image Would be nice if this could be fixed in the next driver version. :) So finally I can split up computation so that at maximum 9x12 blocks are used (12 processors for the FX4600).

Here is the unoptimized hacked code snippet for the spinlocking:

__global__ void computeDistance(float *dimage1,

                                               float *dimage2,

                                               float *dsumvector,

                                               int   *dstate)

{

 [...]

 // Increment the cycle counter so we can synchronize by doing spinlocking on

  // this vector by assuming all blocks have finished if the counter is in the

  // same state for all blocks.

  dstate[blockIdx.x + __mul24(blockIdx.y, gridDim.x)] = 

    dstate[blockIdx.x + __mul24(blockIdx.y, gridDim.x)] + 1;

 // Do the spinlock wait with the first block, first thread and wait until

  // all blocks have the same counter state.

  if(threadIdx.x == 0 && threadIdx.y == 0) {

    if(blockIdx.x == 0 && blockIdx.y == 0) {

      sum = dsumvector[0];

      // Go through the vector and wait until the two successive counters are equal,

      // than sum up the synchronized element

      for(i = 1; i < gridDim.x*gridDim.y; ++i) {

        while(dstate[i-1] != dstate[i]) {}

        sum += dsumvector[i];

      }

      dsumvector[__mul24(gridDim.x, gridDim.y)] = sum;

    }

  }

}

@prkipfer:

I still don’t get your point. :( My pseudocode looks like this:

 loadImagesToSharedMemory();

  do {

    setParameter();

    applyParameter();

    distance = calculateDistance();

    // !!! Here I need the overall distance

  } while(distance > threshold)

Because the images stay the same over the whole while loop and I access them more than once it would be nice if I could place them in shared memory. Now if I call the functions above from the host (as separate kernels), I have to lose the images in shared memory each time I leave one of the kernels right!? How can I prevent this with your approach?