Asynchronous Memory Copy in CUDA Fortran


I was wondering if anyone has some experience/examples of using asynchronous memcpy with CUDA Fortran? At the moment, a program I have has a structure like this:

compute Aerosol Arrays
copy All Device Arrays included Aerosol Arrays to device
copy Constant Data to device
execute Kernel

The issue is that the compute Aerosol Arrays step can be quite long and I figure why not try and overlap as much of the memory copy with that step that I can. In truth, a good chunk of the data copied to the device are those Aerosol Arrays, but, well, every little bit is nice (plus I can learn for the future).

From what I can glean from the CUDA Fortran guides, I assume I’ll have to use the API calls since I don’t think the implicit memory copies are asynchronous. Is this correct?

If so, that’s why I thought I’d ask for examples while I stumble through the cudaStreamCreate, cudaMemcpyToSymbolAsync, etc.


Hi Matt,

Although I haven’t done it myself, you should be able to use the CUDA API calls to accomplish this. Though, I don’t have an example (Sorry).

We’re currently working on expanding the CUDA Fortran language to define this asynchronous behavior. Unfortunately, it doesn’t fit well into the current Fortran syntax so well most likely need to add an extension.

  • Mat

Hmm. Okay. Do you have any examples showing the allocation/copy process using the API calls?

I ask mainly for the 2D and larger arrays. I figure cudaMalloc and cudaMemcpy are pretty simple since 1D is 1D Fortran or C. But when one starts getting into the 2D realm, I’m wondering do you have to use cudaMallocPitch/cudaMemcpy2D (since Fortran arrays usually don’t act like C “arrays”)?

ETA: Never mind, I figured this out (essentially it does what a padded array version of a program I wrote does). I’m next going to start new topic on 3D arrays since that’s all new to me.