Serializing and after parallelism again

burnie · July 10, 2008, 12:58pm

Hi,

I am producing some blockresults (global mem) in my kernel and after checking all other blocks have run (with an atomic counter) i sum up the blockresults with this last block.
After this I want to use again all blocks. For this reason I implemented a flag in global memory which is set after the last block have finished summing up the values.
The other blocks are waiting checking the flag. But as I see this leads to a deadlock… (cause all the blocks couldn’t be sheduled?! ) Even if I only check the flag with only one other threadblock it leads to a deadlock.
But I think it should work when i use only as much threadblocks as multiprocessors are avaliable, right?
The advantage of it would be to not load again the needed values out of slow global memory.
Any thougts about it?

Reimar · July 10, 2008, 2:04pm

You are at least the second person to ask that kind of thing, so try using search, too…

Anyway: Show your code. Since you do not mention anything, I guess you did it wrong (i.e. in a way that would not even work on the CPU).

Note that unless you take special precautions,

while (a) /* wait */;

and

if (a)

  while (true) /* endless loop */;

are the same and the compiler will generate the later since it is smaller and “faster”.

burnie · July 10, 2008, 2:58pm

__global__

void f1_to_fx_gpu (float *P, int Ng, float *odata, float mean, unsigned int *counter)

{

  const int tid = threadIdx.x;

 extern __shared__ float shmem[];

  float* sum1 = (float *)shmem;

   

  for (int bid = blockIdx.x; bid<Ng; bid += gridDim.x)

  {

   sum1[tid] = 0;

   

   float val = P[tid + Ng*bid];

     

   if (val != 0){  sum1[tid] = val * val;  }

  __syncthreads();

   parallel_reduction(sum1, odata, bid);

      

   __syncthreads();

   if (tid == Ng-1) {  atomicAdd(counter,1);  }

  

   if (bid == Ng-1){

    while (*counter != Ng){}

    parallel_reduction(odata, odata, 0);

    *counter = 0;

    counter[1] = 1;

   }

  __syncthreads();

   while (counter[1] != 1){}

   // here I want again full parallelism of all Blocks

  }

}

I tried also with 1,2,3 blocks, then it works, but with 4 or more blocks it didn’t work.

I have a 8600 GT with 4 multiprocessors, so I can’t imagine why it didn’t work at least with 4 blocks…

Can you tell me more about this special precautions?

Thanks in advance!

burnie

PS: counter[0] and counter[1] were set to 0 before from host

Reimar · July 10, 2008, 6:39pm

__global__

void f1_to_fx_gpu (float *P, int Ng, float *odata, float mean, unsigned int *counter)

[....]

    while (*counter != Ng){}

“counter” is not volatile => “while (*counter != Ng){}” is compiled into “if (*counter != Ng) while (1) {}” => can not work.

Use “volatile unsigned int *counter”.

SPWorley · July 10, 2008, 8:46pm

The problem you’re trying to solve is a KERNEL wide syncthreads, where ALL blocks synchronize. You can’t do that in a single kernel.

An easy counterexample is if you have more blocks than will fit onto your multiprocessors, so some run sequentially. If no block exits, the new blocks can never get scheduled, and you deadlock. That’s what’s happening to you.

__syncthreads() will synchronize threads WITHIN a block. Not what you need.

The usual answer is to use block-level synchronization, meaning a kernel call. Split your work into kernel1 and kernel2. Call them sequentially.
You’re guaranteed a “syncblocks” effect between the two kernel invocations.

This is crude because you lose any guarantee of shared memory state, register state, etc. And there’s kernel overhead. But it’s the right way.

Dangerous way, playing with fire, unsupported, evil, hack: Query the device, learn how many multiprocessors there are. Call your kernel with at most that number of blocks… (more MIGHT work if the blocks double or triple up on the MP, depending on register and shared RAM use.) Then your atomic memory busy-loop method would work since all the blocks are running and there’s none that are waiting for other blocks to finish.

pstach · July 10, 2008, 9:09pm

Check out the histogram code in the CUDA SDK, I think that has an answer along the lines of what you’re looking for.

burnie · July 10, 2008, 9:58pm

Thanks for all your sugestions!
I have also found a good thread about it: [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA
I will try if this works for me and give me a better performance then calling a second an a third kernel, which have to load the same data again from global memory.

Reimar · July 11, 2008, 6:31am

Just to make this clear: if you look at the generated ptx you will see that it still can not work without the volatile. I did not mean to make any promises on if and how it will work with the volatile.

burnie · July 12, 2008, 4:49pm

Using too much Registers??

First I got the variable mean from the last kernel from global memory. Then I started the second kernel and passed the real value of mean directly to the kernel.

But now I want to save time and not load the variable from global mem to the host. So I passed only a pointer to the location in global memory to the second kernel. → float mean = *dmean;

But now the kernel produces false results and finishes a lot faster then normal.

When I comment out one of the reduction statements in the kernel it works again (beside the outcommented fragment).

So I guess there are not enough Registers? Can I verify this in a way? Or are there any other suggestions what problem I have?

__global__

void f1_to_fx_gpu (float *P, int Ng, float *odata, float *dmean, volatile unsigned int *mutex)

{

  extern __shared__ float shmem[];

  float* sum1  = (float *)shmem;

  float* sum2  = (float *)&sum1[Ng];

  float* var  = (float *)&sum2[Ng];

  float* idm  = (float *)&var[Ng];

  float* prom  = (float *)&idm[Ng];

  float* shade  = (float *)&prom[Ng];

  float* maxim  = (float *)&shade[Ng];

  float* entropy	= (float *)&maxim[Ng];

  float* corr  = (float *)&entropy[Ng];

 const int tid = threadIdx.x;

  float mean = *dmean;

 for (int bid = blockIdx.x; bid<Ng; bid += gridDim.x)

  {

   sum1[tid] = 0;

   sum2[tid] = 0;

   var[tid] = 0;

   idm[tid] = 0;

   prom[tid] = 0;

   shade[tid] = 0;

   maxim[tid] = 0;

   entropy[tid] = 0;

   corr[tid] = 0;

  float val = P[tid + Ng*bid];

   int x = (bid-mean)+(tid-mean);

   

   if (val != 0){

    sum1[tid] = val * val;

    sum2[tid] = val * (tid-bid)*(tid-bid);

    var[tid] = (tid - mean) * (tid - mean) * val;

    idm[tid] = val / (1 + (tid - bid) * (tid - bid));

    prom[tid] = val*x*x*x*x;

    shade[tid] = val*x*x*x;

    maxim[tid] = val;

    entropy[tid] = val * log(val);

    corr[tid] = (tid - mean) * (bid - mean) * val;

   }

  __syncthreads();

  parallel_reduction(sum1, odata, bid);

   parallel_reduction(sum2, odata+1*Ng, bid);

   parallel_reduction(var, odata+2*Ng, bid);

   parallel_reduction(idm, odata+3*Ng, bid);

   parallel_reduction(prom, odata+4*Ng, bid);

   parallel_reduction(shade, odata+5*Ng, bid);

   parallel_reduction_max(maxim, odata+6*Ng, bid);

   parallel_reduction(entropy, odata+7*Ng, bid);

   //parallel_reduction(corr, odata+8*Ng, bid);

   //sehr wahrscheinlich zuwenig register vorhanden

  }

 #ifndef __DEVICE_EMULATION__

     __syncblocks(mutex);

  #endif

 if (blockIdx.x == 0){

  parallel_reduction(odata, odata, 0);

  parallel_reduction(odata+1*Ng, odata,1);

     parallel_reduction(odata+2*Ng, odata,2);

  parallel_reduction(odata+3*Ng, odata,3);

     parallel_reduction(odata+4*Ng, odata,4);

  parallel_reduction(odata+5*Ng, odata,5);

     parallel_reduction_max(odata+6*Ng, odata,6);

  parallel_reduction(odata+7*Ng, odata,7);

  parallel_reduction(odata+8*Ng, odata,8);

  }

}

__device__ void __syncblocks(volatile unsigned int *mutex)

{

 __syncthreads(); 

if (threadIdx.x == 0) // only let the first  block thread in from here/

 {

   atomicAdd((unsigned int*)mutex,1);

   if (blockIdx.x == 0) // only let the master thread through this point.

   {

  while (*mutex < gridDim.x){ } // this will be true when all blocks are waiting.

     *mutex = 0; // let all blocks through the barrier.

   }

   // keep sitting idle until we're allowed through.

   while (*mutex > 0) {  } 

 }

__syncthreads();

}

Topic		Replies	Views
__syncblocks 101 Primitives for Interblock syncronization CUDA Programming and Performance	16	10033	February 29, 2008
Synchronize all blocks in CUDA CUDA Programming and Performance	12	45950	October 25, 2013
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6706	June 2, 2012
Global thread barrier CUDA Programming and Performance	78	85799	December 23, 2011
sum up blockresults with last thread-block? CUDA Programming and Performance	2	2365	July 7, 2008
Global Sync CUDA Programming and Performance	7	5999	October 4, 2007
Comp Capability 1.0 Multiprocessor Synchronization Can it be done? CUDA Programming and Performance	5	3941	July 23, 2007
Implementing global lock (critical section) on 1-block cuda process CUDA Programming and Performance cuda , kernel	1	178	June 4, 2024
kernels communication and concurrency CUDA Programming and Performance	11	1554	June 8, 2015
Question regarding CUDA streams CUDA Programming and Performance	4	2478	May 21, 2009

Serializing and after parallelism again

Related topics