I have a CUDA program that performs the following steps:
- Load a chunk of data onto GPU (cudaMemcpy - Host to Device )
- Launch a Kernel to work on that chunk
- Read back the output from GPU and goto step 1 for the next chunk. (cudaMemcpy - Device to Host)
I’m working on a Quadro FX5600, a device that doesn’t support Asynchronous Concurrent Execution of Kernel and Memcpy. (1.0 compute capability)
Here are my questions:
Can we overlap steps 3 and 1 in my program? I.e, is it possible to overlap two cudaMemcpy instructions (1 being host2device and the other being device2host)? I am working with different chunks of memory on the host, and I have separate areas for input and output on the GPU.
Can we do step 1 with my device (G80, cuda 1.0 compute)?
If Yes, how do I do it?
Looking for any help in this regard…