A sleep after kernel launches increases performance substantially in my multithreaded cuda DLL

I am trying to maximize the performance of a multithreaded DLL that uses cuda to alleviate an algorithmic bottleneck in a Java application (via JNI).

The Java application launches CPU threads which call the DLL in irregular intervals. A series of kernels are then started for each Java thread, which need to be synchronized before returning the final result.

The code looks something like that:

gpuErrchk(cudaMemcpyAsync(ptr->d_values, smooth_values, 4 * smooth_length * sizeof(float), cudaMemcpyHostToDevice, ptr->stream));
gpuErrchk(cudaMemcpyAsync(ptr->d_valuestmp, ptr->d_values, 4 * smooth_length * sizeof(float), cudaMemcpyDeviceToDevice, ptr->stream));
// Start a range of CUDA streams
for (int i_rep = 0; i_rep < repeats; i_rep++)
  precalculatePowers << <gridSizePowers, ptr->blockSizePowers, 0, ptr->stream >> > (ptr->precalcPow, in, smooth_length, alt);
  //Perform task in CUDA kernel (~2ms)
  SmoothDataPoint << <gridSizeSmooth, ptr->blockSizeSmooth, shared_space * sizeof(float), ptr->stream >> > (out, range, threshold, ptr->precalcPow, 
    in, smooth_length, alt, radius, ptr->blockSizeSmooth,
        startScan - preCalcStart, preCalcStop - stopScan);
   //std::this_thread::sleep_for(std::chrono::milliseconds(2 * repeats));

  gpuErrchk(cudaMemcpyAsync(smooth_values, out, 4 * smooth_length * sizeof(float), cudaMemcpyDeviceToHost, ptr->stream));

This causes the kernels to be launched in series, causing worse performance. If I add a small sleep before the cudaMemcpyAsync, I can see that the kernels are started concurrently, and the performance improves greatly, but I don’t know why.
It seems that something is blocking the other threads from starting their kernels, but I’m not sure what.

Additionally, I have tried different synchronization methods, including EventSynchronization and looped EventQueries, but the results are the same.
For example, instead of StreamSynchronize I used:

cudaEventRecord(ptr->kernelFinish, ptr->stream);
while (cudaEventQuery(ptr->kernelFinish) != cudaSuccess) {

I was hoping this approach would improve performance, but it didn’t. Confusignly, if I remove the sleep from this while, the performance is marginally improved (while keeping the sleep before the cudaMemcpyAsync). How exactly does cudaEventQuery work internally? Does the issue only lie with thread ownership?