stream synchronize problem

hi, I’m trying to use concurrency in my code, but found something I don’t understand with cudaStreamSynchronize

the pseudo code:

int stream_num = 2;
for (int i = 0; i < stream_num; ++i)
	cudaStreamCreate(&stream[i]);

stream_idx = 0;
while(1) { // process loop
	int prev_idx = (stream_idx + stream_num - 1) % stream_num;
	cudaMemcpyAsync(D, H, size, flag, stream[stream_idx]); // H2D
	kernel<<<dim_grid, dim_block, 0 , stream[stream_idx]>>>();
	cudaMemcpyAsync(H, D, size, flag, stream[prev_idx]); // D2H
	cudaStreamSynchronize(stream[prev_idx]);
	stream_idx = (stream_idx + 1) % stream_num;
}
	
for (int i = 0; i < stream_num; ++i)
	cudaStreamDestroy(stream[i]);

The idea is to make 2 streams. each time in the processing loop, issuing H2D memcpy and kernel invocation to the current stream, followed by D2H memcpy to the previous stream and cudaStreamSynchronize the previous stream.

In my case, the kernel takes longer time than H2D and D2H memcpy combined. So I was expecting the perfomance is the same with or without cudaStreamSynchronize (see attached nno-stream-sync.png)

The fact is cudaStreamSynchronize hurt the perfomance by delaying some kernel start for a reason I don’t understand.

The visual profiling screenshot is attached as stream-sync.png

Why kernel-3’s start does not follow right after kernel-2’s end? It seems it’s waiting for another stream’s sync (stream-13), why?

no-stream-sync.png

oh, I have the same problem, Can anyone slove that?

here attach my system info:

Device 0: “GeForce GTX 960M”
SM Capability 5.0 detected:
[GeForce GTX 960M] has 5 MP(s) x 128 (Cores/MP) = 640 (Cores)
Compute performance scaling factor = 1.00
max threads per block: 1024, max dim: [1024 1024 64]
shared memory per block: 49152
is_integrated: 0
pageableMemoryAccess: 0
concurrentManagedAccess: 0
canMapHostMemory: 1
asyncEngineCount: 1
deviceOverlap: 1