Global memory usage profiling and tracking

bsam89 · January 17, 2024, 2:49pm

When I do nvidia-smi, while a process is running it shows more global memory than i had allocated.
My compute sanitizer runs are clean.
I would like to know if there is a way when an application is running when and what is being allocated and when it is freed. I thought nv profiler driver API would show cudaMalloc but it did not.
I ran with nvprof --track-memory-allocations on
How would i find this information.
I am using Tesla V-100 which only supports nvprofiler and not nsight-sys.
Cuda version is 12.0

bsam89 · January 17, 2024, 2:57pm

My GPU code uses quite a bit of templates for the CUDA code and am wondering if the generated code which during loading is saved in GDDR could cause the discrepancy. How could i verify this ?

bsam89 · January 17, 2024, 3:25pm

If I run with nvprof --track-memory-allocations on --print-gpu-trace, and sum up the Size column and it doesn’t match what is being allocated. I used “cudaMalloc” to allocate and do not use unified memory

bsam89 · January 17, 2024, 3:35pm

I followed @Robert_Crovella 's suggestion and used sqllite to look at memory allocations and it was definitely off.
Also I use CDP1 and I wonder if that has any impact

bsam89 · January 17, 2024, 3:37pm

Summing up the bytes it is 30MB, but nvidia-smi shows 2GB

Greg · January 17, 2024, 4:04pm

CUPTI, if I recall, only shows user allocations. Additional allocations include but are not limited to

device stack
instruction RAM
device malloc heap
printf FIFO

The device stack can easily be in the 100-200 MiB on a larger chip. If you have a kernel that is using excessive stack memory (10KiB) then this allocation may exceed 1 GiB.

The device stack allocation is controllable by the developer through (a) careful design of kernels or (b) by calling cudaDeviceSetLimit(cudaLimitStackSize, …). By default the local memory allocation will increase to meet kernels requirements so a single kernel requiring a huge stack will result in the allocation for the duration on the context/device unless the stack size is reduced by calling cudaDeviceSetLimit (or cuCtxSetLimit).

The instruction RAM is controllable based upon the kernels linked into the application. CUDA accelerated libraries can use a lot of instruction RAM. CUDA 11.7 introduced lazy loading which should reduce this size.

The device malloc heap can be configured by calling cudaDeviceSetLimit(cudaLimitMallocHeapSize, …).

The printf FIFO size can be configured by calling cudaDeviceSetLimit(cudaLimitPrintfFifoSize, …).

There are some additional limits that can be controlled.

One method to help debug would be to insert a breakpoint before CUDA runtime/driver are initialized and see how much memory is allocated. Most of the buffers listed above are lazy created on first kernel launch or before first kernel launch that would use the allocation.

bsam89 · January 17, 2024, 5:59pm

Yes. it does seem like CDP and instruction code causes this. I removed some templates and it does seem to lower.
I have 4 devices and 4 processes I explicitly allocate each process to one device . However the instruction and device stack seems to be broadcasted irrespective of which device it is actually running on ? nvidia-smi shows all 4 process for each of the gpu. Is there a way to restrict that ?

Greg · January 17, 2024, 6:36pm

CDP v1 with cudaDeviceSynchronize to wait on children can result in large memory allocations. See cudaDeviceSetLimit( cudaLimitDevRuntimeSyncDepth, …)

I cannot provide aditional information on nvidia-smi output. This might be a better question for the System Management and Monitoring (NVML) - NVIDIA Developer Forums forum or primary CUDA forum.

bsam89 · January 18, 2024, 4:17pm

Nvm… there was some dummy kernel running at startup which loaded the entire codebase onto DDR for some reason

system · February 1, 2024, 4:17pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Total device memory allocated in an application. CUDA Programming and Performance	4	2460	September 17, 2019
Profiling CUDA memory consumption CUDA Programming and Performance	0	508	July 29, 2020
How to view track memory allocations output Visual Profiler and nvprof	1	1414	July 8, 2019
GPU memory allocation not shown in nvprof Visual Profiler and nvprof	1	1415	July 12, 2016
Unable to run profiler: Encountered invalid option --track-memory-allocations Visual Profiler and nvprof	3	3060	March 26, 2018
How to find out how much global memory is being used? CUDA Programming and Performance	1	3283	January 4, 2010
Accurately determining available global memory on a CUDA device CUDA Programming and Performance	2	14410	April 11, 2011
Discrepiances with memory profiling Jetson Xavier NX cuda	2	821	October 18, 2021
allocated Memory Problem CUDA Programming and Performance	1	959	September 28, 2009
Nvidia-smi shows high global memory usage, but low in the only process CUDA Programming and Performance	2	3314	May 16, 2021

Global memory usage profiling and tracking

Related topics