I understand that cuda context takes GPU memory, but sometimes the cost is still surprising.
I’m measuring the cost of cuda context through PyTorch:
import torch
def measure_current_non_torch():
free, total = torch.cuda.mem_get_info()
current_used = total - free
current_torch = torch.cuda.memory_reserved()
current_non_torch = current_used - current_torch
return current_non_torch
torch.cuda.init()
# print in MiB
print(measure_current_non_torch() / 1024 / 1024)
# call matmul with 4096 x 4096 matrix
a = torch.randn(4096, 4096).cuda()
b = torch.randn(4096, 4096).cuda()
c = torch.matmul(a, b)
print(measure_current_non_torch() / 1024 / 1024)
Run with python test.py:
529.0625
597.0625
Run with CUDA_MODULE_LOADING=EAGER python test.py:
1179.0625
1805.0625
It is surprising that the increase (after calling a matmul) can be as large as 700 MiB.
In general, how can we have a rough idea about the amount of memory taken by cuda context? which kind of memory cost is persistent? which kind of memory cost is transient?
In addition, how can I get the current memory cost of the cuda context?
To be clear, I don’t need to predict the cost of the cuda context, I just want to know the current cost, so that I can monitor the rest of memory cost.
In CUDA C++, which is the primary focus of this forum, the usual API call that people use is cudaMemGetInfo
This provides both the total and free memory available on the GPU. If you run this at the beginning of your code, the difference of those two is probably largely the context overhead.
The big exception here would be if you are running on a display GPU. In that case, the display functions could be using memory, which would cloud your ability to see what the context overhead is.
Likewise, as your code proceeds, you can call this API call again to get updates on memory usage.
If you need a pytorch specific answer, I suggest posting on discuss.pytorch.org. There are NVIDIA experts there.
Thanks for the quick response! I know this, but my question is, the context overhead seems to vary during program execution, and I want to know the up-to-date memory overhead of the context.
I think this is not pytorch-specific. torch.cuda.mem_get_info() basically calls cudaMemGetInfo, I just don’t want to compile a C source code.
There isn’t a function to check the “context overhead” that I know of. You would have to keep track of all allocations made in the context, and “assume” that the difference between those and the actual reported memory usage is context overhead. Since various libraries like CUBLAS, etc. may make “temporary” allocations, this would be hard to track perfectly in the general case.
cudaMemGetInfo returns free and available memory, and the diff is the memory used. However, it includes the memory of cuda context iteself, and memory allocated from the context.
To be clear, I’m interested in the memory used by cuda context itself, such as the memory taken to load cuda modules, textures, cudagraph node, etc. I want to separate the memory usage of cuda context itself, from the memory allocated by cudaMalloc/cuMem* , cudaMallocAsync, etc.
What is the use case that requires this data? Are there alternative solutions that can accomplish the same goal?
I am a bit puzzled regarding the necessity of such a feature. There are many thousands of CUDA-accelerated apps, and somehow they manage to work with the functionality available today. What makes this specific use case different?
For what it is worth, you could always file a feature request with NVIDIA. By historical observation, the company is generally responsive to customer requests; that is in large part how the CUDA ecosystem has grown. But this works best when many customers are asking for the same feature.
My use case is: I want to track all the allocations outside of PyTorch and cuda context. Something like NCCL / cublas memory usage.
PyTorch can give me all the memory it allocates, by torch.cuda.memory_reserved() . I can get all of used memory in one device, by cudaMemGetInfo . However, since cuda context takes varying amount of memory, and I don’t have a clean way to track it, I cannot get the memory cost outside PyTorch easily.
I’m implementing a feature in LLM inference, where users can put the inference engine in sleep mode, and wake it up later.
In sleep mode, the inference engine should take as few memory as possible.
For PyTorch’s memory allocation, I can control and offload the allocation. But there’s still some memory allocation outside PyTorch, and I need to investigate if they can be offloaded as well.
The usual way to request extensions to CUDA (e.g. new functions/APIs) is to file a bug. You can provide a link in the bug to this forum post if you think it is important. The better justification you can give, the more likely it is for the request to receive some priority. Things that can already be done using an alternate method may receive lower priority.