Concurrent data copying and kernel execution

I know that overlapping computation and memcpys is good for performance, but i’m curious about accessing that data concurrently copied in after kernel invocation. My own experiments have shown that accessing data copied onto the device by cudamemcpyasync after an asynchronous kernel call (each in two different streams) is not possible. Has anyone tried this/tried it successfully? Does anyone know it is definitely not possible?

Thanks for any help!

It’s not possible.

It’s not possible.

Can you elaborate on exactly why?

Can you elaborate on exactly why?

Race conditions, can’t guarantee that the copy will actually be concurrent, things like that.

Race conditions, can’t guarantee that the copy will actually be concurrent, things like that.

wouldn’t just doing a cudastreamsynchronize to ensure that the cudamemcpyasync completes make sure that the copy completes? And there’s no data race if the only element writing to that memory address is the host (ie device is waiting on a flag to be set by host using that cudamemcpyasync)

wouldn’t just doing a cudastreamsynchronize to ensure that the cudamemcpyasync completes make sure that the copy completes? And there’s no data race if the only element writing to that memory address is the host (ie device is waiting on a flag to be set by host using that cudamemcpyasync)

One option I had considered was using a kernel to set the data up in memory and then call a __threadfence() to ensure all other threads saw the changes, i.e.:

global void kernel(int *flag, …) {

...

while(*flag = 0) ;

...

}

global void set_kernel(int *flag) {

if(threadIdx.x == 0 && blockIdx.x == 0) *flag = 1;

}

int main(int argc, char **argv) {

...

kernel<<<num_blocks, num_threads>>>(d_flag, ...);

....

set_kernel<<<1,32>>>(d_flag);

...

}

but that doesn’t seem to accomplish what I want either. This would only work on Fermi, but according to the CUDA 3.1 Programming Guide __threadfence should make memory accesses visible to “All threads in the device for global memory accesses.” Probably an extremely inefficient solution, but I want to see if this can be done.

One option I had considered was using a kernel to set the data up in memory and then call a __threadfence() to ensure all other threads saw the changes, i.e.:

global void kernel(int *flag, …) {

...

while(*flag = 0) ;

...

}

global void set_kernel(int *flag) {

if(threadIdx.x == 0 && blockIdx.x == 0) *flag = 1;

}

int main(int argc, char **argv) {

...

kernel<<<num_blocks, num_threads>>>(d_flag, ...);

....

set_kernel<<<1,32>>>(d_flag);

...

}

but that doesn’t seem to accomplish what I want either. This would only work on Fermi, but according to the CUDA 3.1 Programming Guide __threadfence should make memory accesses visible to “All threads in the device for global memory accesses.” Probably an extremely inefficient solution, but I want to see if this can be done.