Concurrent data copying and kernel execution

m_habanero · September 17, 2010, 6:12pm

I know that overlapping computation and memcpys is good for performance, but i’m curious about accessing that data concurrently copied in after kernel invocation. My own experiments have shown that accessing data copied onto the device by cudamemcpyasync after an asynchronous kernel call (each in two different streams) is not possible. Has anyone tried this/tried it successfully? Does anyone know it is definitely not possible?

Thanks for any help!

tmurray · September 17, 2010, 7:01pm

It’s not possible.

tmurray · September 17, 2010, 7:01pm

It’s not possible.

m_habanero · September 17, 2010, 7:11pm

Can you elaborate on exactly why?

m_habanero · September 17, 2010, 7:11pm

Can you elaborate on exactly why?

tmurray · September 17, 2010, 7:52pm

Race conditions, can’t guarantee that the copy will actually be concurrent, things like that.

tmurray · September 17, 2010, 7:52pm

Race conditions, can’t guarantee that the copy will actually be concurrent, things like that.

m_habanero · September 17, 2010, 8:10pm

wouldn’t just doing a cudastreamsynchronize to ensure that the cudamemcpyasync completes make sure that the copy completes? And there’s no data race if the only element writing to that memory address is the host (ie device is waiting on a flag to be set by host using that cudamemcpyasync)

m_habanero · September 17, 2010, 8:10pm

wouldn’t just doing a cudastreamsynchronize to ensure that the cudamemcpyasync completes make sure that the copy completes? And there’s no data race if the only element writing to that memory address is the host (ie device is waiting on a flag to be set by host using that cudamemcpyasync)

m_habanero · September 18, 2010, 5:55pm

One option I had considered was using a kernel to set the data up in memory and then call a __threadfence() to ensure all other threads saw the changes, i.e.:

global void kernel(int *flag, …) {

...

while(*flag = 0) ;

...

}

global void set_kernel(int *flag) {

if(threadIdx.x == 0 && blockIdx.x == 0) *flag = 1;

}

int main(int argc, char **argv) {

...

kernel<<<num_blocks, num_threads>>>(d_flag, ...);

....

set_kernel<<<1,32>>>(d_flag);

...

}

but that doesn’t seem to accomplish what I want either. This would only work on Fermi, but according to the CUDA 3.1 Programming Guide __threadfence should make memory accesses visible to “All threads in the device for global memory accesses.” Probably an extremely inefficient solution, but I want to see if this can be done.

m_habanero · September 18, 2010, 5:55pm

One option I had considered was using a kernel to set the data up in memory and then call a __threadfence() to ensure all other threads saw the changes, i.e.:

global void kernel(int *flag, …) {

...

while(*flag = 0) ;

...

}

global void set_kernel(int *flag) {

if(threadIdx.x == 0 && blockIdx.x == 0) *flag = 1;

}

int main(int argc, char **argv) {

...

kernel<<<num_blocks, num_threads>>>(d_flag, ...);

....

set_kernel<<<1,32>>>(d_flag);

...

}

but that doesn’t seem to accomplish what I want either. This would only work on Fermi, but according to the CUDA 3.1 Programming Guide __threadfence should make memory accesses visible to “All threads in the device for global memory accesses.” Probably an extremely inefficient solution, but I want to see if this can be done.

Topic		Replies	Views
Accesing memory from both kernel and host side CUDA Programming and Performance	1	3061	February 17, 2008
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1816	June 23, 2010
some memcopy questions async, ping pong buffering, streaming CUDA Programming and Performance	5	3385	April 29, 2008
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3249	July 14, 2010
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	3019	December 18, 2008
Problem regarding data transfer overlap between multiple asynchronous streams CUDA Programming and Performance	8	870	September 11, 2016
Memcopy while Kernels Running? Performance hit? CUDA Programming and Performance	2	3908	June 5, 2008
same kernel on different data CUDA Programming and Performance	3	1468	November 17, 2008
MemCpyAsync with DevToDev Flag CUDA Programming and Performance	6	13330	February 7, 2008
Concurrent kernel execution Only working with mapped memory CUDA Programming and Performance	6	5815	July 13, 2011

Concurrent data copying and kernel execution

Related topics