Does mixed precision reduce runtime memory size?

Hi everyone, I’m using Nvidia’s example on yolov3_onnx for python and I tried setting all the necessary fp16 parameters.

with trt.Builder(self._TRT_LOGGER) as builder, builder.create_network() as network, trt.OnnxParser(network, self._TRT_LOGGER) as parser:
			builder.max_workspace_size = 1 << 30 # 1GB
			builder.max_batch_size = 1
			builder.fp16_mode = True
			builder.strict_type_constraints= True

I’ve even set each layer to the desired data type:

def _show_network(self, network):
		for index in range(network.num_layers):
			layer = network.get_layer(index)
			layer.precision = trt.float16
			for idx in range(layer.num_outputs):
				layer.set_output_type(idx, trt.float16)

I’m getting the intended speed up on inferencing but what I’m curious about is the runtime memory size for the GPU. Using TensorRT with the fp16 setting unset, the memory usage I get (nvidia-smi) is around 785MB. I was surprised to see that even when I had set all the fp16 settings, the output from nvidia-smi showed that the memory usage is still at 785MB. Is this what I should be seeing? All the while, I thought TensorRT (specifically FP16) will help to reduce the GPU memory usage of the network?

FYI, here are some specs from my system:
X-server (Ubuntu 18.04)
P100 GPU (Driver: 410.104) [1]
CUDA 10.0
TensorRT 5.1.2
YOLOv3

Let me know if you need more information (although I can’t provide the models since they’re confidential; for a client we have). Thanks in advance!

@vincentj

Did You find the answer to this??