Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2

Originally published at:

In part 1 of this series, we introduced new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In this post, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also…

1 Like

This does improve some performance, but there is a question。
Is it possible to pre-allocate a large chunk of video memory and then assign values directly to this chunk of video memory,Will this performance be better? The following part of the code:

float* in_d;
float* in_d_sin[10];
cudaMalloc((void**)&in_d, 4 * 10);// sizeof(float)=4
for (int i = 0; i < 10; i++) {
in_d_sin[i] = in_d + i;

Hey @bjhd_qcj could you elaborate a bit more on your question?

I’m not sure how it relates to cudaMallocAsync or performance.

Your code as written wouldn’t work because you’re allocating device memory with cudaMalloc and then attempting to write to it from host code in the for loop. For that to work you would need to use cudaMallocManaged.

When the model is hot loaded or unloaded, the service experiences performance jitters for several minutes.
The initial analysis of the phenomenon is caused by the slow allocation of video memory, because the video memory grows slowly during the jitter.
Further analysis, our model will call cudaMalloc 3000 times when loading, guess it may be caused by this. So, we want to reduce cudaMalloc calls.

Sounds like you should try out cudaMallocAsync and see if it makes a difference!