Originally published at: https://developer.nvidia.com/blog/using-the-nvidia-cuda-stream-ordered-memory-allocator-part-2/
In part 1 of this series, we introduced new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In this post, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also…
This does improve some performance, but there is a question。
Is it possible to pre-allocate a large chunk of video memory and then assign values directly to this chunk of video memory,Will this performance be better? The following part of the code:
float* in_d;
float* in_d_sin[10];
cudaMalloc((void**)&in_d, 4 * 10);// sizeof(float)=4
for (int i = 0; i < 10; i++) {
in_d_sin[i] = in_d + i;
}
Hey @bjhd_qcj could you elaborate a bit more on your question?
I’m not sure how it relates to cudaMallocAsync
or performance.
Your code as written wouldn’t work because you’re allocating device memory with cudaMalloc
and then attempting to write to it from host code in the for
loop. For that to work you would need to use cudaMallocManaged
.
When the model is hot loaded or unloaded, the service experiences performance jitters for several minutes.
The initial analysis of the phenomenon is caused by the slow allocation of video memory, because the video memory grows slowly during the jitter.
Further analysis, our model will call cudaMalloc 3000 times when loading, guess it may be caused by this. So, we want to reduce cudaMalloc calls.
Sounds like you should try out cudaMallocAsync
and see if it makes a difference!
Hello, I try to understand backward compatibility of the stream ordered memory allocation introduced in CUDA 11.2 with older GPUs. How did you manage to test the feature on V100 GPUs with compute capability 7.0? The GPUs I tested (RTX 4000 and V100) using compute capability older than 8.6 (maybe 8.0 is enough) seem to not support the feature even with drivers supporting CUDA 11.2. Can you explain to me how can I know if a GPU is compatible with a new feature without checking the compatibility at runtime ? Does the Ampere A100 GPU released with CUDA 11.1 supports new CUDA 11.2 features? Thank you
cudaMallocAsync only depends on the version of the driver you are using. If you have the correct driver version, it should work on any GPU still supported by the CUDA Toolkit.
seem to not support the feature even with drivers supporting CUDA 11.2
What did you see that made it seem like cudaMallocAsync wasn’t supported?
I wrote a test program and here is the results I get on a Windows 2019 server with latest drivers version 537.13.
Device Number: 0
Driver / Runtime versions: 12.2 / 11.8
Device Name: Quadro RTX 4000
Compute Capability: 7.5
Memory Pools Supported: false
cudaMallocAsync: operation not supported
cudaFreeAsync: operation not supported
It seems that even with the latest driver, the cudaDevAttrMemoryPoolsSupported attribute is 0 and I get an “operation not supported” error when allocating memory on the RTX 4000.
Running the same test on a 8.6 compute capability GPU using older drivers works well:
Device Number: 0
Driver / Runtime versions: 12.0 / 11.8
Device Name: NVIDIA RTX A2000 Laptop GPU
Compute Capability: 8.6
Memory Pools Supported: true
Is the Quadro RTX 4000 running in TCC mode?
Yes indeed, is this the limitation?
Yes. The feature is not supported on TCC.
Ok thank you for the answer. Do yo plan to support it in a near future?
yes it will be supported, but the exact timeline is still TBD