hi, I’m trying to use concurrency in my code, but found something I don’t understand with cudaStreamSynchronize
the pseudo code:
int stream_num = 2;
for (int i = 0; i < stream_num; ++i)
cudaStreamCreate(&stream[i]);
stream_idx = 0;
while(1) { // process loop
int prev_idx = (stream_idx + stream_num - 1) % stream_num;
cudaMemcpyAsync(D, H, size, flag, stream[stream_idx]); // H2D
kernel<<<dim_grid, dim_block, 0 , stream[stream_idx]>>>();
cudaMemcpyAsync(H, D, size, flag, stream[prev_idx]); // D2H
cudaStreamSynchronize(stream[prev_idx]);
stream_idx = (stream_idx + 1) % stream_num;
}
for (int i = 0; i < stream_num; ++i)
cudaStreamDestroy(stream[i]);
The idea is to make 2 streams. each time in the processing loop, issuing H2D memcpy and kernel invocation to the current stream, followed by D2H memcpy to the previous stream and cudaStreamSynchronize the previous stream.
In my case, the kernel takes longer time than H2D and D2H memcpy combined. So I was expecting the perfomance is the same with or without cudaStreamSynchronize (see attached nno-stream-sync.png)
The fact is cudaStreamSynchronize hurt the perfomance by delaying some kernel start for a reason I don’t understand.
The visual profiling screenshot is attached as stream-sync.png
Why kernel-3’s start does not follow right after kernel-2’s end? It seems it’s waiting for another stream’s sync (stream-13), why?