Global memory usage profiling and tracking

When I do nvidia-smi, while a process is running it shows more global memory than i had allocated.
My compute sanitizer runs are clean.
I would like to know if there is a way when an application is running when and what is being allocated and when it is freed. I thought nv profiler driver API would show cudaMalloc but it did not.
I ran with nvprof --track-memory-allocations on
How would i find this information.
I am using Tesla V-100 which only supports nvprofiler and not nsight-sys.
Cuda version is 12.0

My GPU code uses quite a bit of templates for the CUDA code and am wondering if the generated code which during loading is saved in GDDR could cause the discrepancy. How could i verify this ?

If I run with nvprof --track-memory-allocations on --print-gpu-trace, and sum up the Size column and it doesn’t match what is being allocated. I used “cudaMalloc” to allocate and do not use unified memory

I followed @Robert_Crovella 's suggestion and used sqllite to look at memory allocations and it was definitely off.
Also I use CDP1 and I wonder if that has any impact

Summing up the bytes it is 30MB, but nvidia-smi shows 2GB

CUPTI, if I recall, only shows user allocations. Additional allocations include but are not limited to

  1. device stack
  2. instruction RAM
  3. device malloc heap
  4. printf FIFO

The device stack can easily be in the 100-200 MiB on a larger chip. If you have a kernel that is using excessive stack memory (10KiB) then this allocation may exceed 1 GiB.

The device stack allocation is controllable by the developer through (a) careful design of kernels or (b) by calling cudaDeviceSetLimit(cudaLimitStackSize, …). By default the local memory allocation will increase to meet kernels requirements so a single kernel requiring a huge stack will result in the allocation for the duration on the context/device unless the stack size is reduced by calling cudaDeviceSetLimit (or cuCtxSetLimit).

The instruction RAM is controllable based upon the kernels linked into the application. CUDA accelerated libraries can use a lot of instruction RAM. CUDA 11.7 introduced lazy loading which should reduce this size.

The device malloc heap can be configured by calling cudaDeviceSetLimit(cudaLimitMallocHeapSize, …).

The printf FIFO size can be configured by calling cudaDeviceSetLimit(cudaLimitPrintfFifoSize, …).

There are some additional limits that can be controlled.

One method to help debug would be to insert a breakpoint before CUDA runtime/driver are initialized and see how much memory is allocated. Most of the buffers listed above are lazy created on first kernel launch or before first kernel launch that would use the allocation.

1 Like

Yes. it does seem like CDP and instruction code causes this. I removed some templates and it does seem to lower.
I have 4 devices and 4 processes I explicitly allocate each process to one device . However the instruction and device stack seems to be broadcasted irrespective of which device it is actually running on ? nvidia-smi shows all 4 process for each of the gpu. Is there a way to restrict that ?

CDP v1 with cudaDeviceSynchronize to wait on children can result in large memory allocations. See cudaDeviceSetLimit( cudaLimitDevRuntimeSyncDepth, …)

I cannot provide aditional information on nvidia-smi output. This might be a better question for the System Management and Monitoring (NVML) - NVIDIA Developer Forums forum or primary CUDA forum.

1 Like

Nvm… there was some dummy kernel running at startup which loaded the entire codebase onto DDR for some reason

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.