I created a small program that creates a single context but three streams (in addition to the default).
- Each thread uses one stream and, in a loop, calls cudaMemcpyAsync.
- 512MB copies.
I used NSIGHT to capture this as I wanted characterize how the dual-copy engine worked.
- NSIGHT is showing me that the P4000 is capable of 3 simultaneous copies.
Can someone help me understand the timeline?