I notice in Nsight profile for the runtime api cudamalloc its over 200ms. is there any function or ways to reduce this?
It’s likely some form of start up delay overhead. You can move it around but you probably can’t eliminate it.
Try this: At the start of you application, insert a call cudaFree(0). This will trigger the lazy creation of the CUDA context and should absorb pretty much the entire startup cost of context creation. Since cudaMalloc() is often the first API call that triggers CUDA context creation, the execution time of the first cudaMalloc() will appear exaggerated because it includes the context creation time.
In general in programming high-performance code, it is a best practice to keep dynamic allocation and de-allocation to a minimum. This applies to cudaMalloc() just as it applies to malloc() on the host. Depending on your use case, you could try re-using existing allocations for example.
thx what about cudaMallocHost? is there anyway to reduce that?
cudaMallocHost() is a thin wrapper around OS API calls (on Linux, mmap() if I recall correctly). So if you want faster cudaMallocHost(), check with the OS provider of your choice how those APIs can be made fast. Generally, the speed of host-side allocation may be influenced by the size of allocations as well as how many threads are hitting the allocator simultaneously, although I would hope that the “one giant global lock” allocator style has gone the way of the dodo (i.e. become extinct).
Again, as with all dynamic allocations my advice is to minimize calls to allocation APIs. Ideally you might want to incur this cost once, at program startup.
cudaMallocHost allocates pinned memory in host. It avoids the transfer between pageable memory and pinned memory as you would see in the section of pinned host memory in this link [url]http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/[/url]. The maximum size of pinned memory is dependent with the host side. I remember a thread discussing the maximum size of pinned memory, but I forgot the exact number of it.
As for cudaMalloc, I also faced the same situation like you do. The cudamalloc takes a significant amount of time in a generic vector addition kernel. As txbob and njuffa pointed out, the execution time
of the first cudaMalloc() include context creation time. My strategy was also to reuse the allocation as much as possible on GPU given the size of GPU memory is limited, e.g., 1GB in Nvidia NVS 315.