Euclidian distance and synchronization

SRebhan · July 11, 2007, 10:04am

Ok guys, back from hell.

I implemented the spinlock global sync on the card… External Image As you already mentioned this only works up to a certain number of blocks per processor, so I determined the number of my card. The queue length (blocks per multiprocessor) on the Quadro FX 4600 was finally tested to be 9 after three times rebooting my system because it hung. External Image Would be nice if this could be fixed in the next driver version. :) So finally I can split up computation so that at maximum 9x12 blocks are used (12 processors for the FX4600).

Here is the unoptimized hacked code snippet for the spinlocking:

__global__ void computeDistance(float *dimage1,

                                               float *dimage2,

                                               float *dsumvector,

                                               int   *dstate)

{

 [...]

 // Increment the cycle counter so we can synchronize by doing spinlocking on

  // this vector by assuming all blocks have finished if the counter is in the

  // same state for all blocks.

  dstate[blockIdx.x + __mul24(blockIdx.y, gridDim.x)] = 

    dstate[blockIdx.x + __mul24(blockIdx.y, gridDim.x)] + 1;

 // Do the spinlock wait with the first block, first thread and wait until

  // all blocks have the same counter state.

  if(threadIdx.x == 0 && threadIdx.y == 0) {

    if(blockIdx.x == 0 && blockIdx.y == 0) {

      sum = dsumvector[0];

      // Go through the vector and wait until the two successive counters are equal,

      // than sum up the synchronized element

      for(i = 1; i < gridDim.x*gridDim.y; ++i) {

        while(dstate[i-1] != dstate[i]) {}

        sum += dsumvector[i];

      }

      dsumvector[__mul24(gridDim.x, gridDim.y)] = sum;

    }

  }

}

@prkipfer:

I still don’t get your point. :( My pseudocode looks like this:

 loadImagesToSharedMemory();

  do {

    setParameter();

    applyParameter();

    distance = calculateDistance();

    // !!! Here I need the overall distance

  } while(distance > threshold)

Because the images stay the same over the whole while loop and I access them more than once it would be nice if I could place them in shared memory. Now if I call the functions above from the host (as separate kernels), I have to lose the images in shared memory each time I leave one of the kernels right!? How can I prevent this with your approach?

Topic		Replies	Views
Euclidian Distance between matrices how to find the quikiest way? CUDA Programming and Performance	21	24021	February 11, 2011
__syncblocks 101 Primitives for Interblock syncronization CUDA Programming and Performance	16	10084	February 29, 2008
Global thread barrier CUDA Programming and Performance	78	85985	December 23, 2011
Euclidean distance between 2 matrices using CUDA CUDA Programming and Performance	11	6938	August 17, 2015
How would you do this? CUDA Programming and Performance	12	4505	August 5, 2008
Good way to do sync between/across cores and threadblocks? CUDA Programming and Performance	14	13245	February 8, 2011
Optimizing calculation question Euclidean distance between arrays CUDA Programming and Performance	5	5388	March 1, 2010
__syncthreads and shared memory CUDA Programming and Performance	21	4479	June 15, 2011
Error in Textue CUDA Programming and Performance	17	7574	September 13, 2009
Simple Kernel, but having problems...ideas CUDA Programming and Performance	16	7996	April 24, 2011

Euclidian distance and synchronization

Related topics