`cuda::memcpy_async` and shared memory limits in CUDA Fortran

rcaddy · October 25, 2024, 2:11pm

Is there any way to use cuda::memcpy_async or increase the shared memory limit in CUDA Fortran? I don’t see anything in the CUDA Fortran programming guide about it and the suggested method from the CUDA C++ user guide doesn’t work.

I’m using nvfortran currently but can switch compilers if necessary.

MatColgrove · October 25, 2024, 4:44pm

Sure, to perform asynchronous data transfers you’d use “cudaMemcpyAsync” as shown in this example.

For the shared memory, I believe you’re looking for cudaDeviceSetCacheConfig to set the cache to prefer shared memory,

rcaddy · October 25, 2024, 5:16pm

I’m not talking about async host/device transfers but the async device memory to shared memory transfers using the memcpy_async API introduced in CUDA 11 discussed here.

I tried cudaDeviceSetCacheConfig already. Using it to set the maximum shared memory is not discussed in the CUDA Fortran guide but is discussed in the CUDA C++ guide so I followed that example and it didn’t work. It compiled just fine, it just didn’t increase the maximum shared memory.

MatColgrove · October 25, 2024, 6:18pm

Apologies. Looking at the “cooperative_group” module, I’m not seeing that interface implemented. I added an RFP (TPR #36715), to see if it’s something we can add. There might be another method that I’m not aware of but the person that would know is on vacation so I can’t ask.

The CUDA Fortran APIs are interfaces to the CUDA C API with the docs just describing the interface. For detailed usage, the CUDA C docs are the correct place to look.

You might also try “cudaFuncSetAttribute” with the cudaFuncAttributePreferredSharedMemoryCarveout attribute. Note that reapportioning the cache to prefer more shared vs L1 isn’t supported on all systems.

system · November 8, 2024, 6:18pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Asynchronous Memory Copy in CUDA Fortran Legacy PGI Compilers	2	6238	June 4, 2010
choose 16/48KB vs. 48/16KB Legacy PGI Compilers	1	2963	December 4, 2010
issue using cudaFuncSetCacheConfig setting cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferShare CUDA Programming and Performance	1	921	November 16, 2010
Using memcpy_async in matrix transpose CUDA Programming and Performance cuda , ampere	4	1067	February 10, 2024
cuFuncSetAttribute locks until H2D/D2H async memcpy finishes CUDA Programming and Performance cuda , performance	4	84	February 25, 2025
How to solve the problem "0: copyin MemcpyAsync (dev=0x704b27e00, host=0x203707e00, size=32000) CUDA Programming and Performance	2	739	April 3, 2016
Fine-grained Address Control in cooperative_groups::memcpy_async CUDA Developer Tools	0	551	February 18, 2021
Async GPU Data Tranfer with CUDA Fortran Legacy PGI Compilers	1	1950	January 31, 2011
Coalesced and conflict free memory access using cuda::memcpy_async/cp.async CUDA Programming and Performance cuda	6	646	November 13, 2024
CUDA Fortran shared memory usage Legacy PGI Compilers	1	2459	June 9, 2015

`cuda::memcpy_async` and shared memory limits in CUDA Fortran

Related topics