Asynchronous cudaMalloc

hiprince · June 28, 2012, 6:00am

Hello everyone! I doubt that is there any function or possibility to allocate device memory asynchronously? The background is here. I want to allocate a large(0.1G~1G) amount of memory, which takes some time compare to the other part of my algorithm. If the cudaMalloc could be asynchronous executed, i.e., return right after it’s called and the host can run successive code while the device allocating the memory, the malloc delay will be hidden. This behavior is just like cudaMemcpyAsynch or cudaMemsetAsynch().

Any suggestions?

MKasper · June 28, 2012, 6:28am

What about using multipe threads? I’d go for openMP sections for the most simple solution.

#pragma omp parallel sections num_threads(2)

{

  #pragma omp section

  {

       //alloc mem here

  }

#pragma omp section

  {

       //do other task here

  }

}

hiprince · June 28, 2012, 8:18am

Well, in my situation the space should be continuous, i.e., I want to boost this line

cudaMalloc(ptr, size);

The ideal substitution should like

cudaMallocAsync(ptr, size, [stream]);

// some CPU work, something like host memory allocation.

// cudaMalloc done, continue to cuda computation.

On the other hand, I don’t think multi-thread allocation could speed up since the bottleneck should be device memory speed(Just guess).

What about using multipe threads? I’d go for openMP sections for the most simple solution.
#pragma omp parallel sections num_threads(2)

{

  #pragma omp section

  {

       //alloc mem here

  }

#pragma omp section

  {

       //do other task here

  }

}

cudaDMA · July 2, 2012, 8:26pm

If you are doing memory allocations many times in your algorithm, implement a simple memory management system in which you allocate a huge buffer once and set pointers to appropriate locations (aligned) when requested for memory.

Topic		Replies	Views
No cudaMemsetAsync? CUDA Programming and Performance	1	8488	September 26, 2008
Async questions Kernels appear to stall host threads CUDA Programming and Performance	3	2279	January 20, 2008
about latency to free device memory CUDA Programming and Performance	3	5561	February 18, 2008
Asynchronous copy and Memory allocation for time evolving simulation CUDA Programming and Performance	1	1237	June 14, 2012
accessing device memory during kernel execution CUDA Programming and Performance	3	1547	March 10, 2010
using streams for async memory operations Is it worth splitting kernel launch into several streams ? CUDA Programming and Performance	2	1174	May 20, 2009
Concurrent Kernel Execution / Memory Transfer We can't get it to work... CUDA Programming and Performance	5	4036	March 21, 2009
Implicit synchronization CUDA Programming and Performance	6	3653	April 30, 2015
cuMemAlloc/cuMemFree perfomance Their implementation has anything to do inside the device? CUDA Programming and Performance	3	1690	February 25, 2009
Async memory problems CUDA Programming and Performance	7	7271	February 11, 2011

Asynchronous cudaMalloc

Related topics