Less Asynchronous Data Transfer/Kernel Overlap on a K40 than a GTX 770

I’ve been developing some CUDA code on a NVIDIA GTX 770. To obtain data transfer and kernel execution overlap, I implemented a Finite State Machine to correctly schedule these operations on four streams. Using these strategy it was possible to completely overlap the two. See http://imgur.com/DmGm5tG.

I then tried running the same code on a K40 and it seems there’s far less overlap on this device. See http://imgur.com/hJKn0js It almost appears as if the driver “waits” a bit before scheduling the data transfers. I was surprised because of the more advanced scheduling features and extra copy engine on the GK110 architecture.

I haven’t posted any code because it’s not simple to provide a minimal example. The essential pattern is

  1. transfer data on
  2. execute kernels
  3. transfer a single floating point values off

I was wondering there any configuration options I need to select on the K40 to make it behave like the GTX 770 (Set up HyperQ etc.?)

thanks

are the devices in the same host, or not?

No they aren’t. In fact, the K40 is on an HPC machine (32 core E5-2690 0 @ 2.90GHz, 512GB RAM), shared by other users. Would it be reasonable to assume that the CPU memory bus could be saturated by other processes? However I do have exclusive access to the K40.

as per njuffa:

“Are you using numactl to control CPU and memory affinity such that each GPU always communicates with the “near” CPU and memory?”

https://devtalk.nvidia.com/default/topic/828002/?comment=4518117

I am not, thanks for highlighting this. However, I don’t think memory throughput is the issue – the bandwidth on the transfers above is 9.9GB/s.

i presume you are using pinned memory for the device-to-host transfers; are you using pinned memory for the host-to-device transfers too?

do you have any statistics on the load on the host, and the host memory utility, given the mentioned multiple users?