`cuda::memcpy_async` and shared memory limits in CUDA Fortran

Is there any way to use cuda::memcpy_async or increase the shared memory limit in CUDA Fortran? I don’t see anything in the CUDA Fortran programming guide about it and the suggested method from the CUDA C++ user guide doesn’t work.

I’m using nvfortran currently but can switch compilers if necessary.

Sure, to perform asynchronous data transfers you’d use “cudaMemcpyAsync” as shown in this example.

For the shared memory, I believe you’re looking for cudaDeviceSetCacheConfig to set the cache to prefer shared memory,

I’m not talking about async host/device transfers but the async device memory to shared memory transfers using the memcpy_async API introduced in CUDA 11 discussed here.

I tried cudaDeviceSetCacheConfig already. Using it to set the maximum shared memory is not discussed in the CUDA Fortran guide but is discussed in the CUDA C++ guide so I followed that example and it didn’t work. It compiled just fine, it just didn’t increase the maximum shared memory.

Apologies. Looking at the “cooperative_group” module, I’m not seeing that interface implemented. I added an RFP (TPR #36715), to see if it’s something we can add. There might be another method that I’m not aware of but the person that would know is on vacation so I can’t ask.

The CUDA Fortran APIs are interfaces to the CUDA C API with the docs just describing the interface. For detailed usage, the CUDA C docs are the correct place to look.

You might also try “cudaFuncSetAttribute” with the cudaFuncAttributePreferredSharedMemoryCarveout attribute. Note that reapportioning the cache to prefer more shared vs L1 isn’t supported on all systems.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.