Euclidian distance and synchronization

Hello everybody,

I try to calculate the euclidian distance between two images. For that purpose I partition the images so that the distance of each part is handles by a block of threads. Finally a get a vector with the distance of each block.

Now my problem is how can I ensure that all blocks are finished to calculate the final overall distance?

The still unoptimized code:

__global__ void computeDistance(float *dimage1, float *dimage2, float *dsumvector)


  // Variable declaration

  extern __shared__ float  shared[];

  float                   *distancevec;

  float                   *sumvec;

  float                    difference;

  float                    sum;

  int                      yoffset;

  int                      position;

  int                      linelength;

  int                      i;

 // Setup the partitioning of the shared memory

  distancevec = shared;

  sumvec      = shared + __mul24(blockDim.x, blockDim.y);

 // Calculate the width of the images from the given grid and block size

  linelength = __mul24(gridDim.x, blockDim.x);

 // Get the distance between the two pixels and save to shared mem

  position   = threadIdx.x + __mul24(blockIdx.x, blockDim.x);

  position  += __mul24(threadIdx.y + __mul24(blockIdx.y, blockDim.y), linelength);

  yoffset    = __mul24(threadIdx.y, blockDim.x);

  difference = dimage2[position] - dimage1[position];

  distancevec[threadIdx.x + yoffset] = difference * difference;

 // Wait until all threads calculated the distance


 // Just use the first thread column to sum up rowwise

  if(threadIdx.x == 0) {

    sum = 0.0f;

    for(i = 0; i < blockDim.x; ++i) {

      sum += distancevec[i + yoffset];


    sumvec[threadIdx.y] = sum;


 // Wait until all threads finished summing


 // Sum up columnwise and write the result to global memory

  if(threadIdx.x == 0 && threadIdx.y == 0) {

    sum = 0.0f;

    for(i = 0; i < blockDim.y; ++i) {

      sum += sumvec[i];


    dsumvector[blockIdx.x + __mul24(blockIdx.y, gridDim.x)] = sum;


// Now what should I put in here to calculate the overall distance???


Thanks in advance!

You cannot ensure that in the kernel. You need to finish the current grid. Then, you can copy the vector to the CPU or do a reduction to get the min/max distance over all blocks. See the prefix sum example in the SDK.


<img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ /> I feared that.

For explaination:
The distance calculation is only a part of my algorithm, which does a parameter optimization on the two images. In the process, one image is fix and can be loaded to shared memory (which should be faster than accessing global memory). Now the issue is that if I leave the kernel function for summing on CPU, the shared memory content is lost according to documentation.

So I guess the only solution is to implement a synchronization mechanism myself?! Is there a plan to include such mechanisms in future CUDA releases?



Launching several kernel in sequence is fast as the kernel execution is non-blocking. The first memory transfer will then block however. So be sure to upload everything before starting the kernel call sequence.

Depending on whether you access the static image in an ordered fashion or at random, you can put it into a texture bound to a CUDA array or put it into constant memory. Both are cached with the texture cache being 2D. Both have more latency than shared memory but are persistent and give better performance than reading linear global memory. Both can be updated from the host.


The only global memory synchronization that can be done is through the atomic integer operations available on compute 1.1 cards (8500, 8600). See the programming guide for details. Using these, it would be possible to implement a barrier.

Since the order of blocks running on the device is undefined, it is impossible to “do the synchronization yourself.” Letting the kernel finish is the only way to be certain that all of the results are written to global memory.

How many computations vs memory reads does your parameter optimization take? If the ratio is high, you may find that the added cost of reading in the image again for the second step is minimal compared to the time spent doing computations.

And you don’t need to do the summing on the CPU. You can have the kernel for the second step read in the partial sums from the first step and complete the sum. Part of my program does this, and since the number of partial sums is small (32 to 512), It isn’t a big performance hit just to have every block of the second kernel finish the sum. If the number of partial sums is high, you may need an intermediate kernel that reduces the sum further before running the second step in your algorithm.

I can not start multiple kernels, because the following step of the optimization depends on the result of the distance calculation… :(

The optimization itself is not really computational expensive, approximatly as expensive as the distance calculation (8 add, 5 mul + distance calculation) versus read the two images from memory. The images stay the same, so I tried constant memory, but that didn’t buy me much (actually nothing: I guess this is because the images are larger than the cache, which in the end means I read from global memory).

Regarding the synchronization: Unfortunately I have revision 1.0 card :(, but I thought I make each block to write a “ready” integer into global memory and assign one block to do busy waiting till all blocks finished. Ok there is a chance that I have some additional delay because I read “old/outdated” values from memory, but at least the synchronization should be possible. If all blocks finished, the “special master block” sum up the distances, resets the integer vectors, which causes the other blocks (busy waiting for their status integer to get zero) to start over again. Would this work?

Hmmm… Your spin-wait idea might work, as long as ALL of your blocks fit on the multiprocessors.

If you have more blocks than can be concurrently run on the multiprocessors, this will not work. Here is why. Lets say you can run 4 blocks per multiprocessor (up to 32 blocks total on a GTX), but have 33 to run. The device will wait until one of the blocks finishes before putting the 33rd onto a multiprocessor to run. However, you have a deadlock because any of the first 32 are waiting on the results of the 33rd before they can move on, and will never exit.

edit: I guess that this means my earlier suggestion that the atomic integer operations could be used to implement a barrier was incorrect, as it would under the same problem as above.

That’s ok. Just launch the kernels in the correct sequence. The runtime will finish one grid completely before starting the next one (with the same or a different kernel). So if you do the reduction operation I suggested above, you won’t have to transfer anything between the distance calculation and the optimization. All data stays on the GPU. So the GPU will start immediately with the distance calc while you add kernel calls for the reduction and optimization. Then the first readback of the final result will stall the CPU, which is exactly what you want. If you can, use this time for doing other CPU work before calling the transfer.


Ok guys, back from hell.


I implemented the spinlock global sync on the card… :devil: As you already mentioned this only works up to a certain number of blocks per processor, so I determined the number of my card. The queue length (blocks per multiprocessor) on the Quadro FX 4600 was finally tested to be 9 after three times rebooting my system because it hung. :thumbsdown: Would be nice if this could be fixed in the next driver version. :) So finally I can split up computation so that at maximum 9x12 blocks are used (12 processors for the FX4600).

Here is the unoptimized hacked code snippet for the spinlocking:

__global__ void computeDistance(float *dimage1,

                                               float *dimage2,

                                               float *dsumvector,

                                               int   *dstate)



 // Increment the cycle counter so we can synchronize by doing spinlocking on

  // this vector by assuming all blocks have finished if the counter is in the

  // same state for all blocks.

  dstate[blockIdx.x + __mul24(blockIdx.y, gridDim.x)] = 

    dstate[blockIdx.x + __mul24(blockIdx.y, gridDim.x)] + 1;

 // Do the spinlock wait with the first block, first thread and wait until

  // all blocks have the same counter state.

  if(threadIdx.x == 0 && threadIdx.y == 0) {

    if(blockIdx.x == 0 && blockIdx.y == 0) {

      sum = dsumvector[0];

      // Go through the vector and wait until the two successive counters are equal,

      // than sum up the synchronized element

      for(i = 1; i < gridDim.x*gridDim.y; ++i) {

        while(dstate[i-1] != dstate[i]) {}

        sum += dsumvector[i];


      dsumvector[__mul24(gridDim.x, gridDim.y)] = sum;





I still don’t get your point. :( My pseudocode looks like this:


  do {



    distance = calculateDistance();

    // !!! Here I need the overall distance

  } while(distance > threshold)

Because the images stay the same over the whole while loop and I access them more than once it would be nice if I could place them in shared memory. Now if I call the functions above from the host (as separate kernels), I have to lose the images in shared memory each time I leave one of the kernels right!? How can I prevent this with your approach?

No you cannot keep them in shared mem because you would need to lock the kernel to the multiprocessor with is not possible. But that was not your question, right?

So just write the shared mem to global mem and you can launch the next kernel. You will want to optimize the block layout according to the kernel anyway. Flushing the shared mem to global mem can always be done completely coalesced as will be loading it back for the next kernel.

In case you are aware of how to do that already, forget my posts, they are redundant :)


Oh it was part of my question. :) My problem is that I copy two images into the global memory and than do about 100 iterations on these images (without changing them), where each iteration depends on the distance between the parameterized images. So I access each pixel in both images about 100 times, which for my understanding justifies the afford of copying them to shared memory, right? My problem now is that I can not calculate the distance between the images without leaving the kernel, which means I can not place the images in shared mem, but at best in constant mem (which doesn’t improve the speed [tested]). So the consequence is that my algorithm is constraint by the speed of reads from global mem, which is bad I think!?

Sorry for being so fuzzy in my problem description, hope this helps a little bit to clarify my problem!? Thanks for your help anyway! :thumbup:

Ah, Ok then I didn’t get your question completely. Sorry about that.

You are right, if you cannot run the iteration on a tiled image but need a global distance there is no way to iterate in shared mem.


:) No problem, thanks for help anyway. I think I go with the spinlock for now and test if shared memory buys me as much as I expect for my algorithm. But it would be really nice if I could overcome the 9x12 blocks limit without buying a new card. :)

Any ideas on that? Think this is deeper issue in the drivers thread management and maybe needs extra instructions to tell the driver when a block can be scheduled.


Its not really a deep issue. It’s actually the same thing you’ve been struggling with :) Only N blocks can run concurrently on one multiprocessor (where the shared memory lies). N is determined by register and shared mem usage, see the guide. Since the shared memory exists for the lifetime of the block, the only way the hardware can make room for another block is to let one finish and then start another.

If the hardware were to support running more than these N blocks on a multiprocessor, then it would have to page shared memory out to global memory and you would loose what you are trying to gain with the spinlock :) Of course, the hardware doesn’t support this, so it is a moot point. And I doubt it is just the driver’s management. I would guess that this requirement that blocks finish running is built deep in the silicon that schedules threads.

You are perfectly right and because of this issue I’m not able to handle the partitioning of the data with acceptable effort (together with the limit of 512 threads per block) I give up and calculate the sum on the host processor. Maybe also triggered by this f… bug that I have to reboot my machine every time I get over the limit of N blocks. :wacko: Nobody should do this to a Linux machine.

Thanks for your help and for sharing our thoughts!

Don’t do that. Data transfer costs are prohibitively high. Use the prefix sum from the SDK. This will use global mem but the data stays at least on the GPU.


Don’t panic! :) I calculate the sum of each block on GPU and only sum up the resulting vector on CPU. I just give up to do the later on GPU.

Edit: Wait I see your point! You mean I should invoke another kernel with a different configuration to sum up on GPU right? Yeah thats a pretty good idea! Will try it!

This is as bad as summing the entire thing on the CPU because the constant overhead of the transfer is enormous. Bandwidth is only an issue if transfer sizes differ by several orders of magnitude.


Race condition on the posts :)

Yes. You can immediately launch the summing kernel for example with 1 block and nxn threads = grid of previous kernel. While this is not the most efficient thing, it is much faster than transferring the data. Make sure you don’t call any other CUDA function inbetween the kernels, then it will run efficiently asynchronously.