Memcpy_async in CUDA Fortran

joanib14 · June 18, 2025, 2:19am

Hello,

I was wondering if the memcpy_async routine was available for use in CUDA Fortran now. In response to this posted TPR #36715 was created to possibly add this feature to CUDA Fortran.

I do not see any mention of memcpy_async in the cooperative_groups docs section so it seems that it probably is not available. However, in the post it was mentioned “there might be another method that I’m not aware of but the person that would know is on vacation”. So if it is not available I was curious if there was another method.

Best,
Josh

MatColgrove · June 18, 2025, 3:41pm

Hi Josh,

It doesn’t look like Brent has done anything here (or at least hasn’t updated the report). He’s out on vacation this week, but I added a note requesting status.

-Mat

bleback · June 25, 2025, 12:40am

There have been lots of flavors of new loads and stores, and we are a little bit behind. Here is some info for you though.

For many releases, we have supported “pipelined” async loads. This is one variation supported by CUDA C. It looks like we didn’t document this unfortunately, but there is an example in the 2nd edition of “CUDA Fortran for Scientists and Engineers”. See section 5.3.2.1.

The interfaces for these are in the WMMA module. There are just a few, and if you need more, let us know. Here is the code for the interfaces we have in there:

! PIPELINE Interfaces
interface pipelineMemcpyAsync
attributes(device) subroutine pipelineMemcpyAsyncR8x2(dst, src) &
bind(C, name=“__nvf_wmma_memcpy_async_r16”)
real(8), device :: dst(2), src(2)
end subroutine

attributes(device) subroutine pipelineMemcpyAsyncR8(dst, src) &
bind(C, name=“__nvf_wmma_memcpy_async_r8”)
real(8), device :: dst, src
end subroutine

attributes(device) subroutine pipelineMemcpyAsyncR4x4(dst, src) &
bind(C, name=“__nvf_wmma_memcpy_async_r16”)
real(4), device :: dst(4), src(4)
end subroutine

attributes(device) subroutine pipelineMemcpyAsyncR4(dst, src) &
bind(C, name=“__nvf_wmma_memcpy_async_r4”)
real(4), device :: dst, src
end subroutine
end interface

interface pipelineCommit
attributes(device) subroutine pipelineCommit() &
bind(C, name=“__nvf_wmma_pipeline_commit”)
end subroutine pipelineCommit
end interface pipelineCommit

interface pipelineWaitPrior
attributes(device) subroutine pipelineWaitPrior(prior) &
bind(C, name=“__nvf_wmma_pipeline_wait_prior”)
integer(8), value :: prior
end subroutine pipelineWaitPrior
end interface pipelineWaitPrior

If you need different data types, you can probably just call the C name for a “size” that matches, and that should work. The assembly for these functions gets inlined at compile time, so they are efficient.

Recently we added support for TMA operations too. That WAS documented. See section 3.6.7 of the current CUDA Fortran Programming Guide.

Mat and I will prioritize getting the updated WMMA documentation into our next release.

joanib14 · July 10, 2025, 5:26pm

Thank you @bleback! Sorry for the delayed reponse (I was on vacation).
The pipelineMemcpyAsync routine was exactly what I was looking for. Section 5.3.2.1 of CUDA Fortran for Scientists and Engineers provides a great explanation of the routines too. Thank you for pointing me to that reference.

bleback · July 24, 2025, 12:19am

This is now documented in the CUDA Fortran Programming Guide: NVIDIA CUDA Fortran Programming Guide — NVIDIA CUDA Fortran Programming Guide 25.7 documentation

joanib14 · July 25, 2025, 1:45am

Thank you for getting this in the documentation and following up!

Topic		Replies	Views
Asynchronous Memory Copy in CUDA Fortran Legacy PGI Compilers	2	6238	June 4, 2010
Using cuda pipeline with TMA CUDA Programming and Performance	0	398	May 29, 2024
__pipeline_memcpy_async VS cuda::memcpy_async CUDA Programming and Performance cuda	0	86	November 26, 2024
Questions about "cudaMemcpyAsync" Legacy PGI Compilers	1	2368	November 18, 2011
Performance of memcpyasync CUDA Programming and Performance	2	1139	June 17, 2021
Async transfers and CUFFT CUDA Programming and Performance	0	2078	December 18, 2007
cudaMemsetAsync and easier syntax for async copying Legacy PGI Compilers	2	3535	December 21, 2012
cudaMemcpyAsync problem CUDA Programming and Performance	9	3167	May 26, 2020
CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch CUDA Programming and Performance	7	588	October 19, 2023
cudaMemcpyAsync CUDA Programming and Performance	1	4860	December 8, 2008

Memcpy_async in CUDA Fortran

Related topics