I have recently been developing TensorRT based implementations of a Caffe model that I have access to. The different versions of my TensorRT model use either unified memory or avoid it. The model is a cascade type, and so is built from multiple TensorRT engines that feed into/out of each other.
I have been profiling my models as well as the original Caffe implementation and found that the TensorRT engine runs significantly slower from end-to-end (i.e., the whole cascade model) than the Caffe model. The profiling was performed only for the parts of the code responsible for inference (i.e., I did not profile the building of the TensorRT engines). Note, the individual inference runtime for the component parts of the cascade models have comparable runtimes.
Below I have included results from nvprof showing API calls that are identified by the profiler for Caffe, TensorRT with unified memory, and TensorRT without unified memory. I want to draw your attention to the TensorRT versions. Ignoring cudaLaunch, there seems to be an enormous (20ms) overhead API call cudaGetDeviceProperties (as well as numerous others at the top of their respective lists) that are not being made explicitly in my code. These calls are not present in the profiled Caffe model and seem to be responsible for the increase in runtime. As far as I can tell, they seem to be coming directly from nvinfer.
Does anyone know if/how I can stop these calls from taking place?
CAFFE VERSION
Type Time(%) Time Calls Avg Min Max Name
API calls: 92.18% 347.75ms 324 1.0733ms 3.9630us 345.85ms cudaLaunch
2.95% 11.124ms 104 106.96us 8.6530us 711.96us cudaMemcpy
2.36% 8.9223ms 116 76.916us 7.8400us 545.04us cudaMalloc
2.06% 7.7586ms 30 258.62us 11.226us 1.7240ms cudaMallocHost
0.21% 779.58us 16 48.723us 12.843us 197.55us cudaFreeHost
0.10% 370.07us 47 7.8730us 3.2760us 18.259us cudaMemset
0.06% 243.46us 18 13.525us 7.7740us 35.816us cudaFree
0.04% 140.85us 1202 117ns 85ns 2.0860us cudaSetupArgument
0.02% 69.378us 324 214ns 127ns 1.3840us cudaConfigureCall
0.01% 35.473us 263 134ns 84ns 628ns cudaGetLastError
0.01% 30.309us 51 594ns 399ns 1.2930us cudaStreamWaitEvent
0.01% 25.348us 30 844ns 499ns 2.2150us cudaEventRecord
0.00% 7.4600us 42 177ns 137ns 221ns cudaPeekAtLastError
0.00% 6.5850us 3 2.1950us 1.7810us 2.4950us cudaEventCreate
0.00% 4.7290us 3 1.5760us 1.4400us 1.7480us cudaEventDestroy
TensorRT VERSION Unified memory
Type Time(%) Time Calls Avg Min Max Name
API calls: 87.55% 462.38ms 100 4.6238ms 4.2470us 461.53ms cudaLaunch
3.88% 20.506ms 84 244.12us 227.45us 393.44us cudaGetDeviceProperties
2.20% 11.597ms 147 78.890us 887ns 1.7993ms cudaEventSynchronize
1.77% 9.3487ms 14 667.76us 611.48us 724.51us cudaHostAlloc
1.22% 6.4689ms 16 404.30us 6.2670us 1.8623ms cudaMallocPitch
0.92% 4.8744ms 14 348.17us 268.88us 848.98us cudaFreeHost
0.69% 3.6207ms 230 15.742us 475ns 726.43us cudaFree
0.65% 3.4084ms 172 19.816us 2.4920us 456.83us cudaMalloc
0.20% 1.0391ms 19 54.691us 10.307us 213.37us cudaMallocManaged
0.16% 857.14us 62 13.824us 4.5270us 60.571us cudaLaunchKernel
0.10% 532.60us 1 532.60us 532.60us 532.60us cudaMemcpy2D
0.08% 409.63us 179 2.2880us 1.2970us 7.5200us cudaEventRecord
0.08% 406.81us 760 535ns 377ns 3.2470us cudaEventCreateWithFlags
0.08% 400.77us 182 2.2020us 1.5590us 6.6240us cudaStreamDestroy
0.07% 382.46us 28 13.659us 6.4080us 111.90us cudaMemcpy
0.07% 349.96us 760 460ns 357ns 1.7750us cudaEventDestroy
0.04% 207.92us 147 1.4140us 794ns 3.7120us cudaEventElapsedTime
0.04% 204.72us 56 3.6550us 2.1630us 9.8110us cudaMemsetAsync
0.04% 198.60us 112 1.7730us 1.0630us 7.3200us cudaStreamCreateWithFlags
0.04% 196.05us 32 6.1260us 2.0560us 54.802us cudaDeviceSynchronize
0.03% 142.65us 532 268ns 220ns 1.0650us cudaDeviceGetAttribute
0.02% 109.41us 56 1.9530us 1.4710us 3.6540us cudaThreadSynchronize
0.02% 80.719us 1 80.719us 80.719us 80.719us cudaMemcpy2DAsync
0.01% 76.872us 636 120ns 87ns 1.4580us cudaSetupArgument
0.01% 75.974us 136 558ns 237ns 6.7500us cudaGetDevice
0.01% 64.272us 56 1.1470us 1.0580us 1.2740us cudaStreamCreateWithPriority
0.01% 62.600us 14 4.4710us 3.4300us 6.2920us cudaStreamCreate
0.01% 32.204us 100 322ns 135ns 2.0040us cudaConfigureCall
0.01% 27.397us 134 204ns 122ns 733ns cudaGetLastError
0.00% 26.347us 14 1.8810us 1.5780us 2.7180us cudaHostGetDevicePointer
0.00% 15.666us 9 1.7400us 1.5370us 2.3240us cudaStreamSynchronize
0.00% 9.4850us 14 677ns 579ns 892ns cudaDeviceGetStreamPriorityRange
0.00% 7.1140us 44 161ns 128ns 257ns cudaPeekAtLastError
0.00% 3.0450us 15 203ns 125ns 845ns cudaGetDeviceCount
TensorRT VERSION None unified memory
Type Time(%) Time Calls Avg Min Max Name
API calls: 86.87% 399.18ms 98 4.0732ms 4.8630us 398.03ms cudaLaunch
4.35% 19.997ms 84 238.06us 226.26us 401.58us cudaGetDeviceProperties
1.90% 8.7136ms 14 622.40us 563.19us 715.90us cudaHostAlloc
1.39% 6.3999ms 30 213.33us 25.459us 3.1952ms cudaMemcpyAsync
1.13% 5.2134ms 147 35.465us 814ns 1.5516ms cudaEventSynchronize
0.97% 4.4700ms 14 319.29us 255.07us 781.76us cudaFreeHost
0.87% 4.0073ms 188 21.315us 2.4620us 336.66us cudaMalloc
0.70% 3.2225ms 230 14.010us 488ns 667.09us cudaFree
0.43% 1.9654ms 16 122.84us 5.6400us 412.39us cudaMallocPitch
0.24% 1.1091ms 64 17.329us 4.7470us 59.771us cudaLaunchKernel
0.16% 733.81us 45 16.306us 5.9900us 219.10us cudaMemcpy
0.15% 675.91us 56 12.069us 986ns 387.94us cudaStreamCreateWithPriority
0.10% 454.88us 179 2.5410us 1.2720us 5.2580us cudaEventRecord
0.09% 433.25us 3 144.42us 24.083us 224.10us cudaMallocManaged
0.08% 387.90us 182 2.1310us 1.5020us 10.414us cudaStreamDestroy
0.08% 377.04us 760 496ns 374ns 3.1890us cudaEventCreateWithFlags
0.07% 335.52us 760 441ns 355ns 1.5320us cudaEventDestroy
0.06% 283.76us 112 2.5330us 1.0080us 33.390us cudaStreamCreateWithFlags
0.05% 248.10us 37 6.7050us 1.9500us 55.781us cudaDeviceSynchronize
0.04% 194.49us 44 4.4200us 989ns 27.287us cudaStreamCreate
0.04% 192.58us 147 1.3100us 774ns 4.2910us cudaEventElapsedTime
0.04% 181.30us 56 3.2370us 1.9040us 8.1200us cudaMemsetAsync
0.04% 179.08us 1 179.08us 179.08us 179.08us cudaMemcpy2D
0.03% 134.75us 532 253ns 211ns 4.5540us cudaDeviceGetAttribute
0.02% 103.94us 56 1.8560us 1.4250us 2.3840us cudaThreadSynchronize
0.02% 87.762us 632 138ns 87ns 413ns cudaSetupArgument
0.01% 58.684us 136 431ns 262ns 1.3970us cudaGetDevice
0.01% 32.237us 134 240ns 129ns 599ns cudaGetLastError
0.01% 30.686us 98 313ns 152ns 767ns cudaConfigureCall
0.01% 23.982us 14 1.7130us 1.5170us 2.5270us cudaHostGetDevicePointer
0.00% 21.219us 1 21.219us 21.219us 21.219us cudaMemcpy2DAsync
0.00% 17.485us 9 1.9420us 1.5110us 2.2680us cudaStreamSynchronize
0.00% 9.2380us 14 659ns 532ns 1.0440us cudaDeviceGetStreamPriorityRange
0.00% 8.5240us 44 193ns 128ns 288ns cudaPeekAtLastError
0.00% 3.0480us 15 203ns 140ns 315ns cudaGetDeviceCount