Timeline - Possible inaccurate depiction of cudaMemcpy


I created a small program that creates a single context but three streams (in addition to the default).

  • Each thread uses one stream and, in a loop, calls cudaMemcpyAsync.
  • Device-to-Device
  • 512MB copies.

I used NSIGHT to capture this as I wanted characterize how the dual-copy engine worked.

  • NSIGHT is showing me that the P4000 is capable of 3 simultaneous copies.

Can someone help me understand the timeline?

Thank you.