Provide details on the platforms you are using:
Linux distro and version: Ubuntu 16.04.5 LTS (GNU/Linux 4.4.0-133-generic x86_64)
GPU type: Tesla P4
nvidia driver version: 384.81
CUDA version: 9.0
CUDNN version: 7.4
Python version [if using python]: python 3.5
Tensorflow version: 1.10
TensorRT version: TensorRT 5.0.0 RC / Container image 18.10-py3
If Jetson, OS, hw versions: n/a
Describe the problem:
I am using the script sample inference.py to run the inference TF-TRT5 with different models, all models are working except vgg_16’ ‘vgg_19’ which are throwing memory errors and failing when building TensorRT engine INT8, see below:
VGG_16
2018-11-14 23:22:01.413486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:3b:00.0
totalMemory: 7.43GiB freeMemory: 7.31GiB
2018-11-14 23:22:02.162996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties:
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:5e:00.0
totalMemory: 7.43GiB freeMemory: 7.31GiB
2018-11-14 23:22:02.890455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties:
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:d8:00.0
totalMemory: 7.43GiB freeMemory: 7.31GiB
2018-11-14 23:22:02.897646: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2
2018-11-14 23:22:04.324507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-14 23:22:04.324564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2
2018-11-14 23:22:04.324571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y Y
2018-11-14 23:22:04.324593: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N Y
2018-11-14 23:22:04.324598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: Y Y N
2018-11-14 23:22:04.325264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7029 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:3b:00.0, compute capability: 6.1)
2018-11-14 23:22:04.326184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7029 MB memory) -> physical GPU (device: 1, name: Tesla P4, pci bus id: 0000:5e:00.0, compute capability: 6.1)
2018-11-14 23:22:04.326857: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 7029 MB memory) -> physical GPU (device: 2, name: Tesla P4, pci bus id: 0000:d8:00.0, compute capability: 6.1)
Using checkpoint found at: /home/dell/inference_trt5/pretrained_models/vgg_16/vgg_16.ckpt
2018-11-14 23:22:09.804605: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 3
2018-11-14 23:22:15.170533: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2936] Segment @scope 'vgg_16/', converted to graph
2018-11-14 23:22:15.354693: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:724] Can't determine the device, constructing an allocator at device 0
Cuda error in file src/implicit_gemm.cu at line 585: out of memory
2018-11-14 23:22:39.054257: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger cuda/cudaFusedConvActLayer.cpp (277) - Cuda Error in executeFused: 2
2018-11-14 23:22:39.074433: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger cuda/cudaFusedConvActLayer.cpp (277) - Cuda Error in executeFused: 2
2018-11-14 23:22:39.102904: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:857] Engine creation for segment 0, composed of 89 nodes failed: Internal: Failed to build TensorRT engine. Skipping...
Calibrating INT8...
VGG_19:
2018-11-14 23:23:58.162531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:3b:00.0
totalMemory: 7.43GiB freeMemory: 7.31GiB
2018-11-14 23:23:58.859691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties:
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:5e:00.0
totalMemory: 7.43GiB freeMemory: 7.31GiB
2018-11-14 23:23:59.612628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties:
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:d8:00.0
totalMemory: 7.43GiB freeMemory: 7.31GiB
2018-11-14 23:23:59.614979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2
2018-11-14 23:24:00.973007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-14 23:24:00.973068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 1 2
2018-11-14 23:24:00.973076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N Y Y
2018-11-14 23:24:00.973098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1: Y N Y
2018-11-14 23:24:00.973104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2: Y Y N
2018-11-14 23:24:00.973720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7029 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:3b:00.0, compute capability: 6.1)
2018-11-14 23:24:00.974696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7029 MB memory) -> physical GPU (device: 1, name: Tesla P4, pci bus id: 0000:5e:00.0, compute capability: 6.1)
2018-11-14 23:24:00.976583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 7029 MB memory) -> physical GPU (device: 2, name: Tesla P4, pci bus id: 0000:d8:00.0, compute capability: 6.1)
Using checkpoint found at: /home/dell/inference_trt5/pretrained_models/vgg_19/vgg_19.ckpt
2018-11-14 23:24:06.581198: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 3
2018-11-14 23:24:12.381253: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2936] Segment @scope 'vgg_19/', converted to graph
2018-11-14 23:24:12.658102: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:724] Can't determine the device, constructing an allocator at device 0
Cuda error in file src/implicit_gemm.cu at line 585: out of memory
2018-11-14 23:24:38.190669: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger cuda/cudaFusedConvActLayer.cpp (277) - Cuda Error in executeFused: 2
2018-11-14 23:24:38.207207: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger cuda/cudaFusedConvActLayer.cpp (277) - Cuda Error in executeFused: 2
2018-11-14 23:24:38.231940: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:857] Engine creation for segment 0, composed of 104 nodes failed: Internal: Failed to build TensorRT engine. Skipping...
Calibrating INT8...
command line to reproduce test case
vgg_16
python3 inference.py --model vgg_16 --precision int8 --use_trt --cache --batch_size 1
vgg_19
python3 inference.py --model vgg_19 --precision int8 --use_trt --cache --batch_size 1
some tips to make it work?