Is a sync always required when a second kernel's grid parameters are dependent on the output of the first kernel?

SamMartin · June 4, 2017, 7:21pm

I have this attached pattern in my code a lot. Basically it amounts to a first kernel for filtering a large dataset, where the selected entries returned will be very sparse, and then a second kernel for performing a much more involved computation on the much-reduced dataset.

It seems like the cudaStreamSynchronize is almost superfluous, but I can’t see any way around it. Is there an alternative pattern that helps here? Does CUDA dynamic parallelism help in any way?

/* Pseudocode. Please ignore silly mistakes/syntax and inefficiant/incorrect simplifications */

__global__ void bar( const float * dataIn, float * dataOut, unsigned int * counter_ptr ) 
{
   < do some computation > 
   if (bConditionalComputedAboveIsTrue)
   { 
      const unsigned int ind = atomicInc(counter_ptr, (unsigned int)(-1));
      dataOut[ ind ] = resultOfAboveComputation;
   } 
}

int foo( float * d_datain, float* d_tempbuffer, float* d_output, cudaStream_t stream  ){    
   /* Initialize a counter that will be updated by the bar kernel */ 
   unsigned int * counter_ptr;
   cudaMalloc( &counter_ptr, sizeof( unsigned int) ); //< Create a Counter
   cudaMemsetAsync(counter_ptr, 0, sizeof(unsigned int), stream); //<Initially Set the Counter to 0
   dim3 threadsInit(16,16,1);
   dim3 gridInit(256, 1, 1);
   /* Launch the Filtering Kernel. This will update the value in counter_ptr*/
   bar<<< gridInit, threadsInit, 0, stream >>>( d_datain, d_tempbuffer, counter_ptr );
   /* Download the count and synchronize the stream */ 
   unsigned int count;
   cudaMemcpyAsync(&count, counter_ptr, sizeof(unsigned int), cudaMemcpyDeviceToHost, stream);
   cudaStreamSynchronize( stream ); //< Is there any way around this synchronize? 
   /* Compute the grid parameters and launch a second kernel */
   dim3 bazThreads(128,1,1);
   dim3 bazGrid( count/128 + 1, 1, 1); //< Here I use the counter modified in the prior kernel to set the grid parameters
   baz<<< bazGrid, bazThreads, 0, stream >>>( d_tempbuffer, d_output );
   /* cleanup */
   cudaFree(counter_ptr);
}

tera · June 4, 2017, 7:37pm

Instead of varying the number of blocks in the second kernel, you could use a fixed block count and have the blocks adapt the amount of work they do.

E.g. launch a larger number of blocks, and have them exit early if no work remains. Or launch just enough blocks to fill the device, and have each block loop over the work to to.

tera · June 4, 2017, 7:39pm

Cross-reference: [url=“Is a sync always required when a CUDA kernel's launch parameters are dependent on a prior kernel? - Stack Overflow”]Is a sync always required when a CUDA kernel's launch parameters are dependent on a prior kernel? - Stack Overflow

Jimmy_Pettersson · June 5, 2017, 10:51am

Yes with that pattern you would have to synchronize. This is usually inefficient, so consider tera’s suggenstion above.

This could also be resolved with dynamic parallelism (DP) though I have a feeling it might not be as efficient (I’m not a big user of DP so I could be wrong).

Topic		Replies	Views
About Synchronize CUDA Programming and Performance	4	1490	March 26, 2009
problem about cudaThreadSynchronize() CUDA Programming and Performance	3	7878	November 25, 2007
cudaThreadSynchronize() and multiple kernels when is it necessary to sync? CUDA Programming and Performance	2	8370	June 20, 2008
How to synchronize a Kernel with many for loops CUDA Programming and Performance	12	12101	November 28, 2011
Synchronization methods? CUDA Programming and Performance	11	2217	November 7, 2010
cudaThreadSynchronize usage CUDA Programming and Performance	3	2959	October 21, 2008
Program hangs at cudaThreadsynchronize CUDA Programming and Performance	12	9684	April 7, 2011
Why some synchronize function make cudaMemcpyAsync and kernal in different stream work in sequential CUDA Programming and Performance	2	6582	March 1, 2011
when should cudaThreadSynchronize() be called? CUDA Programming and Performance	5	3321	October 22, 2010
stream synchronize problem CUDA Programming and Performance	2	764	August 28, 2017

Is a sync always required when a second kernel's grid parameters are dependent on the output of the first kernel?

Related topics