16GB cudaMalloc() on A10 (24GB) takes ~300-400ms after previous cudaFree

Hello,

I am making a 16GB allocation using cudaMalloc() for every inference run of an AI model used for the KV cache. The first time I make it takes around 20-30ms on A10 (24GB). After the inference run I free the memory with cudaFree(). Every subsequent cudaMalloc() for the same 16GB allocation takes between 300ms and 400ms.

This only happens if the subsequent 16GB cudaMalloc() is issued pretty quickly after the cudaFree (after 50-100ms). If I sleep for 1 second after the cudaFree() then the next 16GB cudaMalloc() again takes only 20-30ms. This brings me on the suspicion that after cudaFree() there is a deferred task, which clears/zeroes the huge allocation or does some other ops related to it.

I tried using cudaMallocAsync() and then the 300-400ms times transitions to the cudaFree() call later. If I also use cudaFreeAsync() then the time is moved on the next cudaStreamSynchronize() after the cudaFreeAsync().

I was unable to find such a behaviour described anywhere, what might be causing it and how to avoid it. Do you have any idea about the above questions ?

Here is the output of nvidia-smi including the Driver version and CUDA version:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10                     On  | 00000000:06:00.0 Off |                    0 |
|  0%   30C    P8              20W / 150W |     18MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

It is not clear how you are measuring these times. Misattribution of time spent seems like a possibility. Depending on the specifics, cudaFree() can be a pretty slow operation and the time it takes may me misattributed to a following cudaMalloc(). You might want to post a standalone repro code for others to try.

While it is theoretically possible that deallocation includes deferred work, I cannot think of a practical scenario where this kind of design would make sense. Improving the cost of deallocations with possible negative impact on the cost of allocations does not seem like a rational design choice, as allocations are more typically in a time-critical path. I am not aware of documentation describing any kind of deferred actions triggered by cudaFree().

Both GPU allocations and deallocations are typically quite slow, and one would want to re-use existing allocations as much as feasible instead of repeatedly allocating and deallocating.

1 Like

Yes, I believe that on newer versions of CUDA (this functionality did not exist on CUDA day 1, 10+ years ago) that cudaFree will somehow scrub (slide 18) device memory that was deallocated, at some point prior to the next allocation request that would use that memory. Certainly the way to avoid it that seems obvious to me and is already mentioned is to reuse your allocation(s).

2 Likes

Thanks for the pointer. Good to know. Suggestion to NVIDIA: Don’t hide that kind of information deep in a slide deck. I understand that one does not want to telegraph implementation details to the world, but secrecy at this level becomes counter productive.

[Later:] I have now actually looked at slide 18 in the linked presentation. I don’t see where it says that part of the work performed by cudaFree() is deferred. Scrubbing of de-allocated memory has been used by CUDA for a long time. My recollection is very hazy but I think that was part of the work on unified memory. I am pretty sure it already existed when I left NVIDIA in 2014.

However, that only explains why cudaFree() itself is slow, something that has been true for a long time, too. It does not readily imply that part of that work is done lazily and time spent on it therefore can accrue to subsequent cudaMalloc() calls. I checked the following slides but did not spot anything about scrubbing happening in lazy fashion.

If deallocation triggers deferred scrubbing activity, I would suggest clarifying that in the Best Practices Guide .

Thank you for the info @Robert_Crovella. My intuition also pointed to such memory scrubbing mainly for security reasons as I was recently reading about the LeftoverLocals vulnerability (not reproducible on NVIDIA harwdare and only related to local GPU memory, but yeah).

However, I hoped that there will be a setting I can provide to force the scrubbing to not happen. Additionally I thought that requesting the memory through cudaMalloc() from the same process and even the same thread may avoid the scrubbing as I don’t think generally the memory has to be scrubbed in order to guard reading it from the same process.

I can still try to make an isolated example if that is going to be useful. Reusing the allocation is obviously easy to do, but sometimes if the overhead of reallocation and freeing every time is small enough compared to the other computation done in the program it leads to simpler and cleaner API/code.

Did you try to set the release threshold to maximum of the memory pool backing cudaMallocAsync / cudaFreeAsync ? With max threshold, the freed pool allocations should only be returned to the OS if required ( for example when the process exits) or you when you explicitly request it. It should no longer happen on device synchronization / stream synchronization.

1 Like

Thanks for the tip @striker159. I just tried setting the release threshold of the default memory pool to the maximum and it indeed helped as I don’t observe the ~300-400ms spikes no more.
For reference → I used code similar to this to do so:

cudaMemPool_t mempool;
cudaDeviceGetDefaultMemPool(&mempool, device);
uint64_t threshold = UINT64_MAX;
cudaMemPoolSetAttribute(mempool, cudaMemPoolAttrReleaseThreshold, &threshold);

I think this setting should be the default for use cases where it is known that a single process is the main one using the GPU. I found it mentioned as a guideline here. My assumption is that now as we don’t release the memory to the OS we avoid the scrubbing and also the obvious big overhead of allocating memory from the OS again later, but we just reuse the memory from the device’s memory pool.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.