Serializing and after parallelism again


I am producing some blockresults (global mem) in my kernel and after checking all other blocks have run (with an atomic counter) i sum up the blockresults with this last block.
After this I want to use again all blocks. For this reason I implemented a flag in global memory which is set after the last block have finished summing up the values.
The other blocks are waiting checking the flag. But as I see this leads to a deadlock… (cause all the blocks couldn’t be sheduled?! ) Even if I only check the flag with only one other threadblock it leads to a deadlock.
But I think it should work when i use only as much threadblocks as multiprocessors are avaliable, right?
The advantage of it would be to not load again the needed values out of slow global memory.
Any thougts about it?

You are at least the second person to ask that kind of thing, so try using search, too…

Anyway: Show your code. Since you do not mention anything, I guess you did it wrong (i.e. in a way that would not even work on the CPU).

Note that unless you take special precautions,

while (a) /* wait */;


if (a)

  while (true) /* endless loop */;

are the same and the compiler will generate the later since it is smaller and “faster”.


void f1_to_fx_gpu (float *P, int Ng, float *odata, float mean, unsigned int *counter)


  const int tid = threadIdx.x;

 extern __shared__ float shmem[];

  float* sum1 = (float *)shmem;


  for (int bid = blockIdx.x; bid<Ng; bid += gridDim.x)


   sum1[tid] = 0;


   float val = P[tid + Ng*bid];


   if (val != 0){  sum1[tid] = val * val;  }


   parallel_reduction(sum1, odata, bid);



   if (tid == Ng-1) {  atomicAdd(counter,1);  }


   if (bid == Ng-1){

    while (*counter != Ng){}

    parallel_reduction(odata, odata, 0);

    *counter = 0;

    counter[1] = 1;



   while (counter[1] != 1){}

   // here I want again full parallelism of all Blocks



I tried also with 1,2,3 blocks, then it works, but with 4 or more blocks it didn’t work.

I have a 8600 GT with 4 multiprocessors, so I can’t imagine why it didn’t work at least with 4 blocks…

Can you tell me more about this special precautions?

Thanks in advance!


PS: counter[0] and counter[1] were set to 0 before from host


void f1_to_fx_gpu (float *P, int Ng, float *odata, float mean, unsigned int *counter)


    while (*counter != Ng){}

“counter” is not volatile => “while (*counter != Ng){}” is compiled into “if (*counter != Ng) while (1) {}” => can not work.

Use “volatile unsigned int *counter”.

The problem you’re trying to solve is a KERNEL wide syncthreads, where ALL blocks synchronize. You can’t do that in a single kernel.

An easy counterexample is if you have more blocks than will fit onto your multiprocessors, so some run sequentially. If no block exits, the new blocks can never get scheduled, and you deadlock. That’s what’s happening to you.

__syncthreads() will synchronize threads WITHIN a block. Not what you need.

The usual answer is to use block-level synchronization, meaning a kernel call. Split your work into kernel1 and kernel2. Call them sequentially.
You’re guaranteed a “syncblocks” effect between the two kernel invocations.

This is crude because you lose any guarantee of shared memory state, register state, etc. And there’s kernel overhead. But it’s the right way.

Dangerous way, playing with fire, unsupported, evil, hack: Query the device, learn how many multiprocessors there are. Call your kernel with at most that number of blocks… (more MIGHT work if the blocks double or triple up on the MP, depending on register and shared RAM use.) Then your atomic memory busy-loop method would work since all the blocks are running and there’s none that are waiting for other blocks to finish.

Check out the histogram code in the CUDA SDK, I think that has an answer along the lines of what you’re looking for.

Thanks for all your sugestions!
I have also found a good thread about it:
I will try if this works for me and give me a better performance then calling a second an a third kernel, which have to load the same data again from global memory.

Just to make this clear: if you look at the generated ptx you will see that it still can not work without the volatile. I did not mean to make any promises on if and how it will work with the volatile.

Using too much Registers??

First I got the variable mean from the last kernel from global memory. Then I started the second kernel and passed the real value of mean directly to the kernel.

But now I want to save time and not load the variable from global mem to the host. So I passed only a pointer to the location in global memory to the second kernel. -> float mean = *dmean;

But now the kernel produces false results and finishes a lot faster then normal.

When I comment out one of the reduction statements in the kernel it works again (beside the outcommented fragment).

So I guess there are not enough Registers? Can I verify this in a way? Or are there any other suggestions what problem I have?


void f1_to_fx_gpu (float *P, int Ng, float *odata, float *dmean, volatile unsigned int *mutex)


  extern __shared__ float shmem[];

  float* sum1  = (float *)shmem;

  float* sum2  = (float *)&sum1[Ng];

  float* var  = (float *)&sum2[Ng];

  float* idm  = (float *)&var[Ng];

  float* prom  = (float *)&idm[Ng];

  float* shade  = (float *)&prom[Ng];

  float* maxim  = (float *)&shade[Ng];

  float* entropy	= (float *)&maxim[Ng];

  float* corr  = (float *)&entropy[Ng];

 const int tid = threadIdx.x;

  float mean = *dmean;

 for (int bid = blockIdx.x; bid<Ng; bid += gridDim.x)


   sum1[tid] = 0;

   sum2[tid] = 0;

   var[tid] = 0;

   idm[tid] = 0;

   prom[tid] = 0;

   shade[tid] = 0;

   maxim[tid] = 0;

   entropy[tid] = 0;

   corr[tid] = 0;

  float val = P[tid + Ng*bid];

   int x = (bid-mean)+(tid-mean);


   if (val != 0){

    sum1[tid] = val * val;

    sum2[tid] = val * (tid-bid)*(tid-bid);

    var[tid] = (tid - mean) * (tid - mean) * val;

    idm[tid] = val / (1 + (tid - bid) * (tid - bid));

    prom[tid] = val*x*x*x*x;

    shade[tid] = val*x*x*x;

    maxim[tid] = val;

    entropy[tid] = val * log(val);

    corr[tid] = (tid - mean) * (bid - mean) * val;



  parallel_reduction(sum1, odata, bid);

   parallel_reduction(sum2, odata+1*Ng, bid);

   parallel_reduction(var, odata+2*Ng, bid);

   parallel_reduction(idm, odata+3*Ng, bid);

   parallel_reduction(prom, odata+4*Ng, bid);

   parallel_reduction(shade, odata+5*Ng, bid);

   parallel_reduction_max(maxim, odata+6*Ng, bid);

   parallel_reduction(entropy, odata+7*Ng, bid);

   //parallel_reduction(corr, odata+8*Ng, bid);

   //sehr wahrscheinlich zuwenig register vorhanden





 if (blockIdx.x == 0){

  parallel_reduction(odata, odata, 0);

  parallel_reduction(odata+1*Ng, odata,1);

     parallel_reduction(odata+2*Ng, odata,2);

  parallel_reduction(odata+3*Ng, odata,3);

     parallel_reduction(odata+4*Ng, odata,4);

  parallel_reduction(odata+5*Ng, odata,5);

     parallel_reduction_max(odata+6*Ng, odata,6);

  parallel_reduction(odata+7*Ng, odata,7);

  parallel_reduction(odata+8*Ng, odata,8);


__device__ void __syncblocks(volatile unsigned int *mutex)



if (threadIdx.x == 0) // only let the first  block thread in from here/


   atomicAdd((unsigned int*)mutex,1);

   if (blockIdx.x == 0) // only let the master thread through this point.


  while (*mutex < gridDim.x){ } // this will be true when all blocks are waiting.

     *mutex = 0; // let all blocks through the barrier.


   // keep sitting idle until we're allowed through.

   while (*mutex > 0) {  }