Hello everyone! I doubt that is there any function or possibility to allocate device memory asynchronously? The background is here. I want to allocate a large(0.1G~1G) amount of memory, which takes some time compare to the other part of my algorithm. If the cudaMalloc could be asynchronous executed, i.e., return right after it’s called and the host can run successive code while the device allocating the memory, the malloc delay will be hidden. This behavior is just like cudaMemcpyAsynch or cudaMemsetAsynch().
If you are doing memory allocations many times in your algorithm, implement a simple memory management system in which you allocate a huge buffer once and set pointers to appropriate locations (aligned) when requested for memory.