Concurrent memcpy and kernel execution

Hi guys,
I am struggling with streams and memcpys. I have 4 kernels running is a loop. The second computes a single value that I need to copy to the CPU, while the remaining kernels are running. I am using the Visual Profiler to check the timeline of the kernels.
I have tried concurrent streams with events, and async memcopy and pinned memory with cudaDeviceSynchronize at the end of the loop. Unfortunately all variants are causing too much delay.
Does anyone have a good suggestion.

Here is the pseudocode:

while(...)
{
  kernel1<<<...>>>(...);
  kernel2<<<...>>>(...);
  memcpy(&h_val, &d_val, sizeof(float), cudaMemcpyDeviceToHost); 
  kernel3<<<...>>>(...);
  kernel4<<<...>>>(...);
  // do something with h_val
}

Thank you for your help.

I have managed to reduce the latencies between kernel launches.

cudaMallocHost(&val, sizeof(float));
while(...)
{
  kernel1<<<...>>>(..., stream);
  kernel2<<<...>>>(..., stream);
  cudaEventRecord(event, stream);
  kernel3<<<...>>>(..., stream);
  kernel4<<<...>>>(..., stream);
  cudaStreamWaitEvent(stream, event, 0);
  // do something with val
  func(*val)
}

This solution produces the best results so far.

i would think that your kernel grid/ block dimensions and kernel dependencies (kernels requiring other/ previous kernels’ output data, or data from the host) would much determine whether you could still improve on this

All kernels have the same dimensions, each kernel is dependent on the previous kernels output, and they do not require any data from the host.

“All kernels have the same dimensions”

it is still difficult to gauge whether the gpu is fully utilized - one of the key reasons to use streams in the first place

depending on whether the gpu is already fully utilized or not, and on dependencies between loops - what you are looping on - you may for example also have:

cudaEvent_t trigger1, trigger2;

while (…)
{
kernel1 (stream1)
kernel2 (stream1)
cudaEventRecord(trigger1, stream1)
kernel3 (stream1)
kernel4 (stream1)
kernel1 (stream2)
kernel2 (stream2)
cudaEventRecord(trigger2, stream2)
kernel3 (stream2)
kernel4 (stream2)

cudaEventSynchronize(trigger1)
{
do something with value1
}
cudaEventSynchronize(trigger2)
{
do something with value2
}
}

Thanks for the suggestion, but kernel1 depends on kernel4’s output, thus it is not possible.