Hi everyone,
I’m working with CUDA Fortran and wondering if cudaMallocPitch()
for memory allocation can be effectively combined with asynchronous transfers to improve performance.
cudaMallocPitch()
is used to allocate memory with proper alignment, ensuring efficient memory access, especially for 2D arrays or matrices.- Asynchronous transfers (
cudaMemcpyAsync
) allow data transfer to overlap with kernel execution, reducing idle times.
Has anyone experimented with using both techniques together? Can aligning memory with cudaMallocPitch()
improve the performance of asynchronous transfers?
I’d appreciate any insights or experiences!
Thanks!