Asynchronous cudaMalloc

Hello everyone! I doubt that is there any function or possibility to allocate device memory asynchronously? The background is here. I want to allocate a large(0.1G~1G) amount of memory, which takes some time compare to the other part of my algorithm. If the cudaMalloc could be asynchronous executed, i.e., return right after it’s called and the host can run successive code while the device allocating the memory, the malloc delay will be hidden. This behavior is just like cudaMemcpyAsynch or cudaMemsetAsynch().

Any suggestions?

What about using multipe threads? I’d go for openMP sections for the most simple solution.

#pragma omp parallel sections num_threads(2)

{

  #pragma omp section

  {

       //alloc mem here

  }

#pragma omp section

  {

       //do other task here

  }

}

Well, in my situation the space should be continuous, i.e., I want to boost this line

cudaMalloc(ptr, size);

The ideal substitution should like

cudaMallocAsync(ptr, size, [stream]);

// some CPU work, something like host memory allocation.

// cudaMalloc done, continue to cuda computation.

On the other hand, I don’t think multi-thread allocation could speed up since the bottleneck should be device memory speed(Just guess).

If you are doing memory allocations many times in your algorithm, implement a simple memory management system in which you allocate a huge buffer once and set pointers to appropriate locations (aligned) when requested for memory.