concurrency for cudaMemcpy and cuFFT with single SMX

Hi there,

I am working on signal processing using GT 710 which has 1 SMX with 2GB of memory and 196 cores.

When I tried to use multiple thread with different streams, nvvp results indicated concurrency is very low.

In fact, kernel didn’t do almost nothing during copying data from host to device although I created 2 or 4 streams.

FFT input data size is around 1MB.

I am wondering whether this is caused limitation of graphic device or wrong-written program.


For overlap between kernels and host->device copies only two non-default streams should be necessary. Optimal overlap requires that copy time is close to kernel execution time, which may not be the case here (I don’t have time to study the diagrams). Without overlap, what’s the pure copy time, and what’s the pure kernel execution time? With optimal overlap, your stream-based version should have execution time equal to max (copy_time, kernel_execution_time), whereas for the non-stream variant it is (copy_time + kernel_execution_time).

Keep in mind that large FFTs are limited by memory throughput. The GT 710 specification states a theoretical bandwidth of 14.4 GB/sec which means maximum achievable bandwidth is likely around 11.5 GB/sec. This GPU has a PCIe gen2 interface, capable of delivering data at about 6 GB/sec for large copies. Your observation may also be partially related to the conflicting use of the GPU’s memory, i.e. concurrent copy slows down kernel execution compared to the case where copy and kernel execution happen sequentially.

Thank you njuffa for the answer.

When I ran the bandwidthTest sample in NVIDIA samples, the result is as follows. It is different from what you mentioned, but shows similar results compared to nvvp output.

Device 0: GeForce GT 710
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 753.4

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 823.3

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12015.6

Result = PASS

Anyway, Still I have a question. Do you mean if the concurrency is improved, there would be another adverse effect like slow down kernel execution in single SMX? Consequently, would be the performance similar?

I agree with you the fact it will shows maximum performance if the copying time and kernel execution time is same. But at the moment, my most curious question is whether or not this concurrency about copy data and kernel execution(cuffs in this case) can be improved while I use GT 710. There is almost no overlap area between data copy and cuffs even I used different non-default streams for the operations.