cudamalloc slow

s002wjh · November 12, 2015, 6:43pm

I notice in Nsight profile for the runtime api cudamalloc its over 200ms. is there any function or ways to reduce this?

Robert_Crovella · November 12, 2015, 6:58pm

It’s likely some form of start up delay overhead. You can move it around but you probably can’t eliminate it.

njuffa · November 12, 2015, 7:06pm

Try this: At the start of you application, insert a call cudaFree(0). This will trigger the lazy creation of the CUDA context and should absorb pretty much the entire startup cost of context creation. Since cudaMalloc() is often the first API call that triggers CUDA context creation, the execution time of the first cudaMalloc() will appear exaggerated because it includes the context creation time.

In general in programming high-performance code, it is a best practice to keep dynamic allocation and de-allocation to a minimum. This applies to cudaMalloc() just as it applies to malloc() on the host. Depending on your use case, you could try re-using existing allocations for example.

s002wjh · November 13, 2015, 6:13pm

thx what about cudaMallocHost? is there anyway to reduce that?

njuffa · November 13, 2015, 6:31pm

cudaMallocHost() is a thin wrapper around OS API calls (on Linux, mmap() if I recall correctly). So if you want faster cudaMallocHost(), check with the OS provider of your choice how those APIs can be made fast. Generally, the speed of host-side allocation may be influenced by the size of allocations as well as how many threads are hitting the allocator simultaneously, although I would hope that the “one giant global lock” allocator style has gone the way of the dodo (i.e. become extinct).

Again, as with all dynamic allocations my advice is to minimize calls to allocation APIs. Ideally you might want to incur this cost once, at program startup.

LongY · November 13, 2015, 6:53pm

cudaMallocHost allocates pinned memory in host. It avoids the transfer between pageable memory and pinned memory as you would see in the section of pinned host memory in this link [url]http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/[/url]. The maximum size of pinned memory is dependent with the host side. I remember a thread discussing the maximum size of pinned memory, but I forgot the exact number of it.

As for cudaMalloc, I also faced the same situation like you do. The cudamalloc takes a significant amount of time in a generic vector addition kernel. As txbob and njuffa pointed out, the execution time
of the first cudaMalloc() include context creation time. My strategy was also to reuse the allocation as much as possible on GPU given the size of GPU memory is limited, e.g., 1GB in Nvidia NVS 315.

Topic		Replies	Views
CudaMalloc is taking huge time for first time, How to overcome this issue CUDA Programming and Performance cuda	1	1048	April 12, 2021
Why is cudaMallocHost() so slow? CUDA Programming and Performance	7	8845	November 17, 2021
cudaHostAlloc: Pinned memory creation very slow! CUDA Programming and Performance	7	7608	January 5, 2012
cudaMalloc execution time CUDA Programming and Performance	2	40	December 16, 2024
Is cudaHostAlloc() fast? CUDA Programming and Performance	5	535	March 28, 2024
cudaHostAlloc - very slow the first time CUDA Programming and Performance	2	2899	April 26, 2012
cudaMallocHost() vs. malloc() 1st "cudaMallocHost()" lasts ~90ms!! CUDA Programming and Performance	5	15067	July 3, 2007
cudaMalloc takes several seconds CUDA Programming and Performance	6	2506	August 13, 2013
Is there any possibility to create constexpr CUDA resource allocation? CUDA Programming and Performance	3	26	October 17, 2024
Why does cudaMallocHost takes so muck time compared to malloc? CUDA Programming and Performance	9	2137	August 26, 2011

cudamalloc slow

Related topics