I have used tensorrt to accelerate my application instead of caffe, and I found the memory would add about 400MB when I parse two model, I want to execute 4 processes in TX2, but now I can only run 2 because of the tensorrt ouucpy too many memory, I have not create too many buffer, but when I only new a tensorrt object construct function, the memory would go up quickly.
There is two source consumes memory:
1. Loading libraries: (TensorRT, cuDNN, cuBLAS…)
- Amount: around 600Mib (TensorRT3)
- Required but is shared with all the processes.
2. Building inference engine:
- Amount: depends on the network size
- Can be limited by setMaxWorkspaceSize() and setMaxBatchSize(). Each process has their own consumption.
When I didn’t start any process, the memory use about ( 2388 / 7851 MB),
When I start a process, the memory use about (3592 / 7851 MB),
When I stary second process, the memory would be (6000 / 7851 MB), I dont konw why the second process would be twice times than the first process when use tensorRT, when I only use caffe, second process would be using the same memory with first process.
The data of (3592 / 7851 MB) -> (6000 / 7851 MB) is using TensorRT or Caffe?
Could share the detail memory usage for both TensorRT and Caffe case?
More, please also share the information about your model.
I have use three model with MTCNN, P net, R net and O net. In attachment is the model.
And I used tensorRT for R net and O net. In these two nets, I write Prelu layer for them.
And I used caffe for P net, because tensorRT can’t support for the dynamic resolution input.
I have tested just use R net and O net, it also would appear this phenomenon.
model.rar (2.08 MB)
I have tested it again with Rnet and O net. Now, I have retrained the model, and use ReLU instead of PReLU, so I didn’t have to write plugin layer myself. And I found the memory raise up 600MB when I started a process, the second process would also about 600MB, so I think the issue may occur in plugin, so I add the plugin in attachment.
trtplugin.h (9.1 KB)
From your description, the abnormal memory should be allocated from plugin implementation.
Quick check your source, there is an allocation call for PRelu parameters:
CHECK(cudaMalloc(&deviceData, count * sizeof(float)));
But the weight of PRelu layer should be few.
Could you tell us or check the value of allocated memory amount?
As you mentioned, the libraries are shared with all processes. But in my experiments, I found the memory usage is linearly increased when the number of process increased. (increased ~800MB for each process) Do we need to do any setting to make sure the libaries are shared among all processes on the same GPU? Thanks!
Please help to open a new topic for your issue.