streaming: can I have more control?

Hi. Below is the code in question (I cleaned out error checking, parameter assignments, and thread control for brevity), it is used as a separate thread in examples below that have two of these threads. Basically it is a recursive reduction that does dot product: in the first iteration it multiplies components and then just adds the results together until fully reduced.

void
recursiveKernel2DThread(uint32_t myIndex)
{
  CUresult status	= cuCtxSetCurrent(m_context->m_context);
  CUstream str;
  status = cuStreamCreate(&str, 0);

  std::valarray<void*> reduce2DParams(5);
  CUdeviceptr myTempWeightsDev = m_tempWeightsDev + myIndex*m_inputDimension*sizeof(float);

  while(m_running)
  {
    status = cuLaunchKernel(m_dotComponents, m_inputGridSize_2, 1, 1, m_blockSize, 1, 1,
                            m_inputDimension*sizeof(float), str,
                            &(reduce2DParams[0]), 0);

    const uint32_t doubleBlock = m_blockSize*2;
    uint32_t remaining = m_inputDimension;
    uint32_t gridSize = m_inputGridSize_2;

    while(m_running)
    {
      if (remaining < doubleBlock)
      {
//      std::this_thread::sleep_for(std::chrono::microseconds(10));
        status = cuMemcpyDtoDAsync(m_responseDev + myIndex*sizeof(float), myTempWeightsDev, sizeof(float), str);
        break;
      }

      remaining = remaining/doubleBlock + ((remaining%doubleBlock)?1:0);
      gridSize = gridSize/doubleBlock + ((gridSize%doubleBlock)?1:0);

      status = cuLaunchKernel(m_reduceSum, gridSize, 1, 1, m_blockSize, 1, 1,
                              remaining*sizeof(float), str,
                              &(reduce2DParams[0]), 0);
    }
  }
}

Here are three different profile runs of the same code:

  1. http://imgur.com/OYk0Dul

  2. http://imgur.com/Q5RCZK6

  3. http://imgur.com/JCPnkZ7

Here is the closest to the ideal (only happened when I uncommented the sleep line above)

http://imgur.com/n45DXMt

I guess I was expecting something like the last one except that reduceSum in stream 14 will start as soon as dotComponents is finished in the same stream, not after it is finished in both streams. Is it achievable?

In general, is there a way to force the last case over the first three without arbitrary insertions of sleep (which will have system dependent duration obviously)? Or at least eliminate the worst cases like 2 and 3?