Hi guys,
I am struggling with streams and memcpys. I have 4 kernels running is a loop. The second computes a single value that I need to copy to the CPU, while the remaining kernels are running. I am using the Visual Profiler to check the timeline of the kernels.
I have tried concurrent streams with events, and async memcopy and pinned memory with cudaDeviceSynchronize at the end of the loop. Unfortunately all variants are causing too much delay.
Does anyone have a good suggestion.
Here is the pseudocode:
while(...)
{
kernel1<<<...>>>(...);
kernel2<<<...>>>(...);
memcpy(&h_val, &d_val, sizeof(float), cudaMemcpyDeviceToHost);
kernel3<<<...>>>(...);
kernel4<<<...>>>(...);
// do something with h_val
}
i would think that your kernel grid/ block dimensions and kernel dependencies (kernels requiring other/ previous kernels’ output data, or data from the host) would much determine whether you could still improve on this
it is still difficult to gauge whether the gpu is fully utilized - one of the key reasons to use streams in the first place
depending on whether the gpu is already fully utilized or not, and on dependencies between loops - what you are looping on - you may for example also have: