cudaMalloc and cudaMemcpy (cudaMemcpyAsync) are too time-consuming

Hi,The program takes a total of 5ms, but the GPU real calculation is only 0.5ms, and the rest of the time is spent on memory malloc and copying, do you have any good suggestions?

Hi hailiangyang,

Instead of allocating and freeing memory for each computation, you can preallocate a chunk of memory that can be reused for multiple operations, reducing the overhead of memory allocation and eventual deallocation.

Also, try to minimize unnecessary data transfers between the CPU and GPU. If possible, try to perform multiple computations on the GPU before transferring the results back to the CPU. This reduces the overhead of copying data between the CPU and GPU.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.