Are all memcpy operations issued in order in a GTX card, no matter what stream is and compute capabi

I’m asking this because I’m really tired of an issue I have on a GTX Titan card.

No matter what I do, all memory operations seems to be executed in-order. I thouhgt GTX Titan implements Hyper-Q in cudaStreams and that this feature should permit to issue memory operations in any order but it seems this only works for computing (kernel executions). Does it make sense?

I really believe all memcpy are issued in-order and, because there is only one copy engine, achieving overlapping between kernel executions and memcpy are really hard.

For example, typical scenario:
HD1 → K1 → DH2
HD2 → K2 → DH2

If I launch HD1->K1->DH2->HD2->K2->DH2, in my GTX Titan (WDDM driver) this makes HD2 wait till DH2 has ended … the only way to achieve concurrency is doing HD1->K1->HD2->K2 and using streamcallbacks for getting the DH copies after the Kernels finish execution.

Could anyone test this code, and tell me if it works for her, please?
I’m interested in results with geforce cards.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"


#include <stdio.h>
#define N 1024*1024

__global__ void someKernel(int *d_in, int *d_out) {
    for (int i = threadIdx.x; i < threadIdx.x + 1024; i++) {
        d_out[i] = d_in[i];
    }
}

int main () {
    int *h_bufferIn[100];
    int *h_bufferOut[100];
    int *d_bufferIn[100];
    int *d_bufferOut[100];

    //allocate some memory
    for (int i = 0; i < 100; i++) {
        cudaMallocHost(&h_bufferIn[i],N*sizeof(int));
        cudaMallocHost(&h_bufferOut[i],N*sizeof(int));
        cudaMalloc(&d_bufferIn[i], N*sizeof(int));
        cudaMalloc(&d_bufferOut[i], N*sizeof(int));
    }

    //create cuda streams
    cudaStream_t st[2];
    cudaStreamCreate(&st[0]);
    cudaStreamCreate(&st[1]);

    //trying to overlap computation and memcpys
    for (int i = 0; i < 100; i+=2) {
        cudaMemcpyAsync(d_bufferIn[i], h_bufferIn[i], N*sizeof(int), cudaMemcpyHostToDevice, st[i%2]);
        someKernel<<<1,256, 0, st[i%2]>>>(d_bufferIn[i], d_bufferOut[i]);
        cudaMemcpyAsync(h_bufferOut[i], d_bufferOut[i], N*sizeof(int), cudaMemcpyDeviceToHost, st[i%2]);

        cudaMemcpyAsync(d_bufferIn[i+1], h_bufferIn[i+1], N*sizeof(int), cudaMemcpyHostToDevice, st[(i+1)%2]);
        someKernel<<<1,256, 0, st[(i+1)%2]>>>(d_bufferIn[i+1], d_bufferOut[i+1]);
        cudaMemcpyAsync(h_bufferOut[i+1], d_bufferOut[i+1], N*sizeof(int), cudaMemcpyDeviceToHost, st[(i+1)%2]);
    }
    cudaDeviceSynchronize();
}

Original question in stackoverflow:
http://stackoverflow.com/questions/17564791/what-is-the-best-strategy-to-overlap-kernel-execution-and-data-transfers-in-a-gt