How to Use Dual Copy Engines on GTX 1070??

Hello,

I tested the dual copy engines using cudamemcpyasnyc() on GTX 1070 with CUDA 8.0
But, Only 1 copy engine worked. When I excuted DeviceQuary.exe in CUDA Samples, I confirmed that GTX 1070 has the Dual Copy Engine.

It is my test code

for(int i=0; i<iNumStream; i++)
{
    int iOffset = i * iStreamSize;
    cudaMemcpyAsync(&device_R[iOffset], &host_R[iOffset], iStreamSize, cudaMemcpyHostToDevice, streams[i]);
    Kernel<<<dimGrid, dimBlock, 0, streams[i]>>>(device_R, device_Out, iOffset);
    cudaMemcpyAsync(&host_Out[iOffset], &device_Out[iOffset], iStreamSize, cudaMemcpyDeviceToHost, streams[i]);
}

Please Help me!!

copy operations issued to the same stream will not overlap.

The issue you are running into is not whether you can get dual copy engines to work, but the structure of your code.

as a simple test, do something like this:

cudaMemcpyAsync(&device_R[iOffset], &host_R[iOffset], iStreamSize, cudaMemcpyHostToDevice, streams[0]);
    cudaMemcpyAsync(&host_Out[iOffset], &device_Out[iOffset], iStreamSize, cudaMemcpyDeviceToHost, streams[1]);

If you have dual copy engines, those operations should overlap.

Also, OS matters here. If you are running a windows WDDM GPU, that can get in the way of witnessing concurrency.