the output reports “ERROR from element primary-nvinference-engine: Infer operation failed”.
The detail output is
Now playing: /root/DeepStream_Release/samples/streams/sample_720p.h264
>>> Generating new TRT model engine
Using FP32 data type.
***** Storing serialized engine file as /root/DeepStream_Release/sources/apps/sample_apps/deepstream-test1/../../../../samples/models/Primary_Detector/resnet10.caffemodel_b1_fp32.engine batchsize = 1 *****
Running...
Frame Number = 0 Number of objects = 0 Vehicle Count = 0 Person Count = 0
cuda/cudaFusedConvActLayer.cpp (287) - Cuda Error in executeFused: 48
cuda/cudaFusedConvActLayer.cpp (287) - Cuda Error in executeFused: 48
Enqueue failed during inference
ERROR from element primary-nvinference-engine: Infer operation failed
Error details: gstnvinfer.c(781): gst_nvinfer_inference_thread (): /GstPipeline:dstest1-pipeline/GstNvInfer:primary-nvinference-engine
Returned, stopping playback
Frame Number = 1 Number of objects = 0 Vehicle Count = 0 Person Count = 0
cuda/cudaFusedConvActLayer.cpp (287) - Cuda Error in executeFused: 48
cuda/cudaFusedConvActLayer.cpp (287) - Cuda Error in executeFused: 48
Enqueue failed during inference
Deleting pipeline
By the way, I replaced nveglglessink with fakesink, and changed network-mode to 0.
Hi,
Can you try to run like below, change the path to yours accordingly
~/work/cpxavier/TensorRT-5.0.2/usr/src/tensorrt/bin/trtexec --deploy=/home/tse/work/dssource/DeepStreamSDK/Model/IVAPrimary_resnet10_DeepstreamRel_V2_ivalarge_its_phase1/resnet10.prototxt --output=conv2d_bbox --output=conv2d_cov --batch=2 --device=1 --int8
my system have 2 nvidia GPU cards, hereby i use gpuid 1, it’s one p4 card, change to yours accordingly, and feedback when you got the result? thanks.
root@cb9d71d14ade:~/DeepStream_Release/samples/models/Primary_Detector# /usr/local/TensorRT-5.0.2.6/bin/trtexec --deploy=resnet10.prototxt --output=conv2d_bbox --output=conv2d_cov --batch=2 --device=0
deploy: resnet10.prototxt
output: conv2d_bbox
output: conv2d_cov
batch: 2
device: 0
Input "input_1": 3x368x640
Output "conv2d_bbox": 16x23x40
Output "conv2d_cov": 4x23x40
name=input_1, bindingIndex=0, buffers.size()=3
name=conv2d_bbox, bindingIndex=1, buffers.size()=3
name=conv2d_cov, bindingIndex=2, buffers.size()=3
Average over 10 runs is 9.6178 ms (host walltime is 9.81203 ms, 99% percentile time is 9.64509).
Average over 10 runs is 9.61194 ms (host walltime is 9.81955 ms, 99% percentile time is 9.63098).
Average over 10 runs is 9.60836 ms (host walltime is 9.80102 ms, 99% percentile time is 9.64755).
Average over 10 runs is 9.60653 ms (host walltime is 9.79853 ms, 99% percentile time is 9.6416).
Average over 10 runs is 9.60748 ms (host walltime is 9.80463 ms, 99% percentile time is 9.62582).
Average over 10 runs is 9.61442 ms (host walltime is 9.81987 ms, 99% percentile time is 9.63053).
Average over 10 runs is 9.61797 ms (host walltime is 9.82143 ms, 99% percentile time is 9.65859).
Average over 10 runs is 9.62259 ms (host walltime is 9.8242 ms, 99% percentile time is 9.65754).
Average over 10 runs is 9.62715 ms (host walltime is 9.83242 ms, 99% percentile time is 9.66416).
Average over 10 runs is 9.61373 ms (host walltime is 9.81387 ms, 99% percentile time is 9.63581).
with int8
root@cb9d71d14ade:~/DeepStream_Release/samples/models/Primary_Detector# /usr/local/TensorRT-5.0.2.6/bin/trtexec --deploy=resnet10.prototxt --output=conv2d_bbox --output=conv2d_cov --batch=2 --device=0 --int8
deploy: resnet10.prototxt
output: conv2d_bbox
output: conv2d_cov
batch: 2
device: 0
int8
Input "input_1": 3x368x640
Output "conv2d_bbox": 16x23x40
Output "conv2d_cov": 4x23x40
Int8 support requested on hardware without native Int8 support, performance will be negatively affected.
name=input_1, bindingIndex=0, buffers.size()=3
name=conv2d_bbox, bindingIndex=1, buffers.size()=3
name=conv2d_cov, bindingIndex=2, buffers.size()=3
Average over 10 runs is 9.61389 ms (host walltime is 9.73778 ms, 99% percentile time is 9.63885).
Average over 10 runs is 9.62688 ms (host walltime is 9.75592 ms, 99% percentile time is 9.64278).
Average over 10 runs is 9.62987 ms (host walltime is 9.75898 ms, 99% percentile time is 9.65258).
Average over 10 runs is 9.63567 ms (host walltime is 9.76663 ms, 99% percentile time is 9.65658).
Average over 10 runs is 9.62528 ms (host walltime is 9.75364 ms, 99% percentile time is 9.6543).
Average over 10 runs is 9.6302 ms (host walltime is 9.75916 ms, 99% percentile time is 9.65443).
Average over 10 runs is 9.62786 ms (host walltime is 9.75631 ms, 99% percentile time is 9.6544).
Average over 10 runs is 9.62383 ms (host walltime is 9.75005 ms, 99% percentile time is 9.64026).
Average over 10 runs is 9.62651 ms (host walltime is 9.7801 ms, 99% percentile time is 9.63904).
Average over 10 runs is 9.61735 ms (host walltime is 9.78592 ms, 99% percentile time is 9.63341).
By the way, the host machine is CentOS 7.4.1708. CUDA 10.0 is mapped from host directory into container when run nvidia-docker. The full container start command is
nvidia-docker run -it -w /root -v /usr/local/cuda-10.0:/usr/local/cuda deepstream_docker_image bash