Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2

jwitsoe · July 27, 2021, 8:47pm

Originally published at: https://developer.nvidia.com/blog/using-the-nvidia-cuda-stream-ordered-memory-allocator-part-2/

In part 1 of this series, we introduced new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In this post, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also…

bjhd_qcj · October 31, 2022, 7:20am

This does improve some performance, but there is a question。
Is it possible to pre-allocate a large chunk of video memory and then assign values directly to this chunk of video memory，Will this performance be better? The following part of the code：

float* in_d;
float* in_d_sin[10];
cudaMalloc((void**)&in_d, 4 * 10);// sizeof(float)=4
for (int i = 0; i < 10; i++) {
in_d_sin[i] = in_d + i;
}

jhemstad · November 8, 2022, 7:52pm

Hey @bjhd_qcj could you elaborate a bit more on your question?

I’m not sure how it relates to cudaMallocAsync or performance.

Your code as written wouldn’t work because you’re allocating device memory with cudaMalloc and then attempting to write to it from host code in the for loop. For that to work you would need to use cudaMallocManaged.

bjhd_qcj · November 9, 2022, 7:20am

When the model is hot loaded or unloaded, the service experiences performance jitters for several minutes.
The initial analysis of the phenomenon is caused by the slow allocation of video memory, because the video memory grows slowly during the jitter.
Further analysis, our model will call cudaMalloc 3000 times when loading, guess it may be caused by this. So, we want to reduce cudaMalloc calls.

jhemstad · November 9, 2022, 11:21pm

Sounds like you should try out cudaMallocAsync and see if it makes a difference!

brunnergilles · August 29, 2023, 1:38pm

Hello, I try to understand backward compatibility of the stream ordered memory allocation introduced in CUDA 11.2 with older GPUs. How did you manage to test the feature on V100 GPUs with compute capability 7.0? The GPUs I tested (RTX 4000 and V100) using compute capability older than 8.6 (maybe 8.0 is enough) seem to not support the feature even with drivers supporting CUDA 11.2. Can you explain to me how can I know if a GPU is compatible with a new feature without checking the compatibility at runtime ? Does the Ampere A100 GPU released with CUDA 11.1 supports new CUDA 11.2 features? Thank you

jhemstad · September 5, 2023, 3:27pm

cudaMallocAsync only depends on the version of the driver you are using. If you have the correct driver version, it should work on any GPU still supported by the CUDA Toolkit.

seem to not support the feature even with drivers supporting CUDA 11.2

What did you see that made it seem like cudaMallocAsync wasn’t supported?

brunnergilles · September 11, 2023, 3:19pm

I wrote a test program and here is the results I get on a Windows 2019 server with latest drivers version 537.13.

Device Number: 0
Driver / Runtime versions: 12.2 / 11.8
Device Name: Quadro RTX 4000
Compute Capability: 7.5
Memory Pools Supported: false

cudaMallocAsync: operation not supported
cudaFreeAsync: operation not supported

It seems that even with the latest driver, the cudaDevAttrMemoryPoolsSupported attribute is 0 and I get an “operation not supported” error when allocating memory on the RTX 4000.
Running the same test on a 8.6 compute capability GPU using older drivers works well:

Device Number: 0
Driver / Runtime versions: 12.0 / 11.8
Device Name: NVIDIA RTX A2000 Laptop GPU
Compute Capability: 8.6
Memory Pools Supported: true

vkini · September 11, 2023, 7:12pm

Is the Quadro RTX 4000 running in TCC mode?

brunnergilles · September 12, 2023, 6:18am

Yes indeed, is this the limitation?

vkini · September 12, 2023, 6:41am

Yes. The feature is not supported on TCC.

brunnergilles · September 12, 2023, 6:52am

Ok thank you for the answer. Do yo plan to support it in a near future?

vkini · September 12, 2023, 4:57pm

yes it will be supported, but the exact timeline is still TBD

Topic		Replies	Views
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1 Technical Blog	1	640	September 13, 2024
Introducing Low-Level GPU Virtual Memory Management Technical Blog	59	7455	June 4, 2024
Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager Technical Blog	9	925	March 27, 2021
Cuda memory pool performance issue CUDA Programming and Performance cuda , api	4	2227	February 1, 2022
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2196	January 18, 2023
cudaMalloc performance issue after p2p access is enabled CUDA Programming and Performance	10	347	June 5, 2024
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2090	February 5, 2020
Cannot get any stream parallelism. CUDA Programming and Performance	13	1273	December 31, 2019
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204315	April 13, 2009
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37245	August 30, 2009

Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2

Related topics