For overlap between kernels and host->device copies only two non-default streams should be necessary. Optimal overlap requires that copy time is close to kernel execution time, which may not be the case here (I don’t have time to study the diagrams). Without overlap, what’s the pure copy time, and what’s the pure kernel execution time? With optimal overlap, your stream-based version should have execution time equal to max (copy_time, kernel_execution_time), whereas for the non-stream variant it is (copy_time + kernel_execution_time).
Keep in mind that large FFTs are limited by memory throughput. The GT 710 specification states a theoretical bandwidth of 14.4 GB/sec which means maximum achievable bandwidth is likely around 11.5 GB/sec. This GPU has a PCIe gen2 interface, capable of delivering data at about 6 GB/sec for large copies. Your observation may also be partially related to the conflicting use of the GPU’s memory, i.e. concurrent copy slows down kernel execution compared to the case where copy and kernel execution happen sequentially.
When I ran the bandwidthTest sample in NVIDIA samples, the result is as follows. It is different from what you mentioned, but shows similar results compared to nvvp output.
Device 0: GeForce GT 710
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 753.4
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 823.3
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12015.6
Result = PASS
Anyway, Still I have a question. Do you mean if the concurrency is improved, there would be another adverse effect like slow down kernel execution in single SMX? Consequently, would be the performance similar?
I agree with you the fact it will shows maximum performance if the copying time and kernel execution time is same. But at the moment, my most curious question is whether or not this concurrency about copy data and kernel execution(cuffs in this case) can be improved while I use GT 710. There is almost no overlap area between data copy and cuffs even I used different non-default streams for the operations.