Combining cudaMallocPitch() with Asynchronous Transfers in CUDA Fortran Post Content:

Hi everyone,

I’m working with CUDA Fortran and wondering if cudaMallocPitch() for memory allocation can be effectively combined with asynchronous transfers to improve performance.

  • cudaMallocPitch() is used to allocate memory with proper alignment, ensuring efficient memory access, especially for 2D arrays or matrices.
  • Asynchronous transfers (cudaMemcpyAsync) allow data transfer to overlap with kernel execution, reducing idle times.

Has anyone experimented with using both techniques together? Can aligning memory with cudaMallocPitch() improve the performance of asynchronous transfers?

I’d appreciate any insights or experiences!

Thanks!

I have not but believe you should be using cudaMemcpy2DAsync when working with pitched arrays.

@mfatica might know more, but I would presume that this would be a bit slower than using a 1D array.

-Mat