Difference of memory usage at each GPU model during tensorflow c++ inference

Hello everyone

I am train and freeze tensorflow graph at python and inference at tensorflow c++ api on windows 10.
During test, I figure out that GPU memory usage is different by gpu model.

I tested 5 gpus (1060(6GB), 1080Ti, 1660Ti, 2070, 2080Ti)
Test method is simple.
First, I mount single gpu and install driver.
And I run my c++ inference code.
I use same code and model(freezed graph pb file)
Finally, I check difference of gpu memory usage between before and after run c++ inference code

I use tensorflow gpu option allow_growth=true

I tested few times at every gpus.
Result of average memory usage at tensorflow is followed

1060 : 397 MB
1080Ti : 481 MB
1660Ti : 621 MB
2070 : 644 MB
2080Ti : 712 MB

This is happen when i used simple 2 layer FC layer for mnist example graph.

Why is this different?
Is this because of nvidia configuration or tensorflow?

I want to use small memory usage regardless of gpu model
Please help me.

There are several factors contributing to the the overall TF memory footprint, including:

  1. The CUDA driver creates per-core contexts in device memory with space for things like stack memory and thread-local storage. This overhead will increase linearly with the number of CUDA cores on the GPU. (This is likely the dominant reason the 1080Ti has a larger memory footprint than the 1060.)
  2. CUDNN and cuBLAS provide optimized kenrels for convolution and matrix multiple routines. Some of these require additional workspace allocations and are only available on some architectures (Turing provides TensorCore kernels, Pascal does not, for example).
  3. The TensorFlow allocator grabs chunks of memory at a time, so falling slightly over a memory limit may result in a larger increase in allocated space then expected.

In general, allow_growth is not a good way to limit overall device memory consumption because it does not provide any information about what footprint is acceptable. Without that info, the framework will optimize for speed (by choosing fast but memory-hungry algorithms). A better option is to set the per_process_gpu_memory_fraction config option.

Thank you for detail explanation.

And I have additional question.
In answer 2), Can I unable the optimized kernels for convolution and matrix multiple routines at CUDNN and cuBLAS?
Because I use Turing architectures GPU.

Is there anyway to edit CUDNN or CUBLAS to not use the optimized kernels?
I try to find the way, but i couldn’t

TensorFlow will attempt to optimize for speed within an allowed memory footprint. You can reduce that memory footprint by setting the per_process_gpu_memory_fraction instead of allow_growth.