Memcpy_async in CUDA Fortran

Hello,

I was wondering if the memcpy_async routine was available for use in CUDA Fortran now. In response to this posted TPR #36715 was created to possibly add this feature to CUDA Fortran.

I do not see any mention of memcpy_async in the cooperative_groups docs section so it seems that it probably is not available. However, in the post it was mentioned “there might be another method that I’m not aware of but the person that would know is on vacation”. So if it is not available I was curious if there was another method.

Best,
Josh

Hi Josh,

It doesn’t look like Brent has done anything here (or at least hasn’t updated the report). He’s out on vacation this week, but I added a note requesting status.

-Mat

There have been lots of flavors of new loads and stores, and we are a little bit behind. Here is some info for you though.

For many releases, we have supported “pipelined” async loads. This is one variation supported by CUDA C. It looks like we didn’t document this unfortunately, but there is an example in the 2nd edition of “CUDA Fortran for Scientists and Engineers”. See section 5.3.2.1.

The interfaces for these are in the WMMA module. There are just a few, and if you need more, let us know. Here is the code for the interfaces we have in there:

! PIPELINE Interfaces
interface pipelineMemcpyAsync
attributes(device) subroutine pipelineMemcpyAsyncR8x2(dst, src) &
bind(C, name=“__nvf_wmma_memcpy_async_r16”)
real(8), device :: dst(2), src(2)
end subroutine

attributes(device) subroutine pipelineMemcpyAsyncR8(dst, src) &
bind(C, name=“__nvf_wmma_memcpy_async_r8”)
real(8), device :: dst, src
end subroutine

attributes(device) subroutine pipelineMemcpyAsyncR4x4(dst, src) &
bind(C, name=“__nvf_wmma_memcpy_async_r16”)
real(4), device :: dst(4), src(4)
end subroutine

attributes(device) subroutine pipelineMemcpyAsyncR4(dst, src) &
bind(C, name=“__nvf_wmma_memcpy_async_r4”)
real(4), device :: dst, src
end subroutine
end interface

interface pipelineCommit
attributes(device) subroutine pipelineCommit() &
bind(C, name=“__nvf_wmma_pipeline_commit”)
end subroutine pipelineCommit
end interface pipelineCommit

interface pipelineWaitPrior
attributes(device) subroutine pipelineWaitPrior(prior) &
bind(C, name=“__nvf_wmma_pipeline_wait_prior”)
integer(8), value :: prior
end subroutine pipelineWaitPrior
end interface pipelineWaitPrior


If you need different data types, you can probably just call the C name for a “size” that matches, and that should work. The assembly for these functions gets inlined at compile time, so they are efficient.

Recently we added support for TMA operations too. That WAS documented. See section 3.6.7 of the current CUDA Fortran Programming Guide.

Mat and I will prioritize getting the updated WMMA documentation into our next release.

Thank you @bleback! Sorry for the delayed reponse (I was on vacation).
The pipelineMemcpyAsync routine was exactly what I was looking for. Section 5.3.2.1 of CUDA Fortran for Scientists and Engineers provides a great explanation of the routines too. Thank you for pointing me to that reference.

This is now documented in the CUDA Fortran Programming Guide: NVIDIA CUDA Fortran Programming Guide — NVIDIA CUDA Fortran Programming Guide 25.7 documentation

Thank you for getting this in the documentation and following up!