I’ve been testing some code on a GPU and I found a surprising performance results when processing data in 2 CUDA streams on a V100. In every example I’ve seen where two streams are used to overlap compute and copy operations, each stream is processing independent data. My code was structured this way. As an experiment, I tried rearranging my code so there is a dedicated compute stream and a dedicated copy stream. When I did this my code ran twice as fast.
I have no idea why this would happen. My problem has a typical copy host to device - compute - copy device to host pattern, but the first copy takes a negligible amount of time and the device to host copy takes about the same time as the compute operation.
So, my first arrangement, in pseudocode was:
for i = 1 to num_iterations
stream = streams[i % num_streams] // num_streams = 2
copy input data from host to device in stream
compute in stream
copy output data from device to host in stream
Each iteration would alternate streams. I did not issue like operations together like some blogs suggest, because I’ve read that isn’t necessary on new GPUs.
My second arrangement was:
for i = 1 to num_iterations
copy input data from host to device in <b>compute stream</b> // Takes almost no time
compute in <b>compute stream</b>
synchronize to prevent the copy stream from starting too soon
copy output data from device to host in <b>copy stream</b>
Does anyone have good intuition for why the second arrangement would be twice as fast? Is this indicative of a bug or error I haven’t discovered yet?
Basically, I am suspicious because it worked so well and I’ve never seen any examples of streams being organized this way.