Is a sync always required when a second kernel's grid parameters are dependent on the output of the first kernel?

I have this attached pattern in my code a lot. Basically it amounts to a first kernel for filtering a large dataset, where the selected entries returned will be very sparse, and then a second kernel for performing a much more involved computation on the much-reduced dataset.

It seems like the cudaStreamSynchronize is almost superfluous, but I can’t see any way around it. Is there an alternative pattern that helps here? Does CUDA dynamic parallelism help in any way?

/* Pseudocode. Please ignore silly mistakes/syntax and inefficiant/incorrect simplifications */

__global__ void bar( const float * dataIn, float * dataOut, unsigned int * counter_ptr ) 
   < do some computation > 
   if (bConditionalComputedAboveIsTrue)
      const unsigned int ind = atomicInc(counter_ptr, (unsigned int)(-1));
      dataOut[ ind ] = resultOfAboveComputation;

int foo( float * d_datain, float* d_tempbuffer, float* d_output, cudaStream_t stream  ){    
   /* Initialize a counter that will be updated by the bar kernel */ 
   unsigned int * counter_ptr;
   cudaMalloc( &counter_ptr, sizeof( unsigned int) ); //< Create a Counter
   cudaMemsetAsync(counter_ptr, 0, sizeof(unsigned int), stream); //<Initially Set the Counter to 0
   dim3 threadsInit(16,16,1);
   dim3 gridInit(256, 1, 1);
   /* Launch the Filtering Kernel. This will update the value in counter_ptr*/
   bar<<< gridInit, threadsInit, 0, stream >>>( d_datain, d_tempbuffer, counter_ptr );
   /* Download the count and synchronize the stream */ 
   unsigned int count;
   cudaMemcpyAsync(&count, counter_ptr, sizeof(unsigned int), cudaMemcpyDeviceToHost, stream);
   cudaStreamSynchronize( stream ); //< Is there any way around this synchronize? 
   /* Compute the grid parameters and launch a second kernel */
   dim3 bazThreads(128,1,1);
   dim3 bazGrid( count/128 + 1, 1, 1); //< Here I use the counter modified in the prior kernel to set the grid parameters
   baz<<< bazGrid, bazThreads, 0, stream >>>( d_tempbuffer, d_output );
   /* cleanup */

Instead of varying the number of blocks in the second kernel, you could use a fixed block count and have the blocks adapt the amount of work they do.

E.g. launch a larger number of blocks, and have them exit early if no work remains. Or launch just enough blocks to fill the device, and have each block loop over the work to to.


Yes with that pattern you would have to synchronize. This is usually inefficient, so consider tera’s suggenstion above.

This could also be resolved with dynamic parallelism (DP) though I have a feeling it might not be as efficient (I’m not a big user of DP so I could be wrong).