I have a CUDA FORTRAN code which has the following structure:
host code arrays copy from host to device do i = 1,n device code arrays copy from device to host (with cudaMemcpyAsync directive) end do other host code
In particular, I need to copy an array called t (which contains the time value) calculated on the device to the host each time step. The syntax of the cudaMemcpyAsync directive is the following:
ierr = cudaMemcpyAsync(t,t_dev,1,stream_copy)
- t and t_dev are 1-element arrays on the host and on the device. The one on the host is declared with the “pinned” attribute;
- stream_copy is a cuda stream integer declared and created as follows:
integer :: stream_copy ierr = cudaStreamCreate(stream_copy)
The device’s kernels are executed on a stream called “stream_work” generated as the previous one.
What I get are a lot of time values which are the same in chunks of 8-10, since to copy operation requires more time than the elaboration of the datas. Is there any way to improve this? The output I obtain right now is useless and, if I copy each step a time value from the device to the host (without an asynchronous operation), the execution gets really slow (way slower than the CPU code).
Thanks in adavance for your help,