I am trying to implement asynchronous data transfer between gpu and cpu in cuda fortran. My data is a 3D array, which means i should use cudamemcpy3dasync. But the cuda fortran reference is too simple and i donot know how to fill in the “cudaMemcpy3DParms” structure. Anybody has any experience about how to perform asynchronous data transfer of 3D array??
By the way, if i want to copy a 4D array to gpu asynchronously, must i split it into many 3D arrays? or are there other alternative methods?
I don’t have an example off hand but could pull one together. Though, it’s probably not necessary to use the 3D functions. If you are copying the entire array, you can simply use cudaMemCpyAsync. Fotran arrays are contiguous so just copy it as a 1-D array with a size of NML. Same could be done with a 4-D array.
Actually i am trying to implement a program that can perform asynchronous data transfering and kernel execution. I think i have to divide my data (which is a 4D array) into severel parts, and each part of the data can be transfered in different stream and the kernel in the same stream can then be executed. Can i also use cudaMemCpyAsync to do this???
I just cannot find some examples about how to use cudaMemCpyAsync3D. Surely i need to copy parts of a 3D array (say A(N1,N2,N3) )each time to overlap communication and computation. For example, i need to copy A(N1/2,N2/2,N3/2) ,and then launch a kernel, and then copy another part of A and then execute the kernel.
Can i use cudaMemCpyAsync to do this or How to do this by using cudaMemCpyAsync3D??