Cascade TensorRT models: slow runtime

I have recently been developing TensorRT based implementations of a Caffe model that I have access to. The different versions of my TensorRT model use either unified memory or avoid it. The model is a cascade type, and so is built from multiple TensorRT engines that feed into/out of each other.

I have been profiling my models as well as the original Caffe implementation and found that the TensorRT engine runs significantly slower from end-to-end (i.e., the whole cascade model) than the Caffe model. The profiling was performed only for the parts of the code responsible for inference (i.e., I did not profile the building of the TensorRT engines). Note, the individual inference runtime for the component parts of the cascade models have comparable runtimes.

Below I have included results from nvprof showing API calls that are identified by the profiler for Caffe, TensorRT with unified memory, and TensorRT without unified memory. I want to draw your attention to the TensorRT versions. Ignoring cudaLaunch, there seems to be an enormous (20ms) overhead API call cudaGetDeviceProperties (as well as numerous others at the top of their respective lists) that are not being made explicitly in my code. These calls are not present in the profiled Caffe model and seem to be responsible for the increase in runtime. As far as I can tell, they seem to be coming directly from nvinfer.

Does anyone know if/how I can stop these calls from taking place?

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   92.18%  347.75ms       324  1.0733ms  3.9630us  345.85ms  cudaLaunch
                    2.95%  11.124ms       104  106.96us  8.6530us  711.96us  cudaMemcpy
                    2.36%  8.9223ms       116  76.916us  7.8400us  545.04us  cudaMalloc
                    2.06%  7.7586ms        30  258.62us  11.226us  1.7240ms  cudaMallocHost
                    0.21%  779.58us        16  48.723us  12.843us  197.55us  cudaFreeHost
                    0.10%  370.07us        47  7.8730us  3.2760us  18.259us  cudaMemset
                    0.06%  243.46us        18  13.525us  7.7740us  35.816us  cudaFree
                    0.04%  140.85us      1202     117ns      85ns  2.0860us  cudaSetupArgument
                    0.02%  69.378us       324     214ns     127ns  1.3840us  cudaConfigureCall
                    0.01%  35.473us       263     134ns      84ns     628ns  cudaGetLastError
                    0.01%  30.309us        51     594ns     399ns  1.2930us  cudaStreamWaitEvent
                    0.01%  25.348us        30     844ns     499ns  2.2150us  cudaEventRecord
                    0.00%  7.4600us        42     177ns     137ns     221ns  cudaPeekAtLastError
                    0.00%  6.5850us         3  2.1950us  1.7810us  2.4950us  cudaEventCreate
                    0.00%  4.7290us         3  1.5760us  1.4400us  1.7480us  cudaEventDestroy

TensorRT VERSION Unified memory
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   87.55%  462.38ms       100  4.6238ms  4.2470us  461.53ms  cudaLaunch
                    3.88%  20.506ms        84  244.12us  227.45us  393.44us  cudaGetDeviceProperties
                    2.20%  11.597ms       147  78.890us     887ns  1.7993ms  cudaEventSynchronize
                    1.77%  9.3487ms        14  667.76us  611.48us  724.51us  cudaHostAlloc
                    1.22%  6.4689ms        16  404.30us  6.2670us  1.8623ms  cudaMallocPitch
                    0.92%  4.8744ms        14  348.17us  268.88us  848.98us  cudaFreeHost
                    0.69%  3.6207ms       230  15.742us     475ns  726.43us  cudaFree
                    0.65%  3.4084ms       172  19.816us  2.4920us  456.83us  cudaMalloc
                    0.20%  1.0391ms        19  54.691us  10.307us  213.37us  cudaMallocManaged
                    0.16%  857.14us        62  13.824us  4.5270us  60.571us  cudaLaunchKernel
                    0.10%  532.60us         1  532.60us  532.60us  532.60us  cudaMemcpy2D
                    0.08%  409.63us       179  2.2880us  1.2970us  7.5200us  cudaEventRecord
                    0.08%  406.81us       760     535ns     377ns  3.2470us  cudaEventCreateWithFlags
                    0.08%  400.77us       182  2.2020us  1.5590us  6.6240us  cudaStreamDestroy
                    0.07%  382.46us        28  13.659us  6.4080us  111.90us  cudaMemcpy
                    0.07%  349.96us       760     460ns     357ns  1.7750us  cudaEventDestroy
                    0.04%  207.92us       147  1.4140us     794ns  3.7120us  cudaEventElapsedTime
                    0.04%  204.72us        56  3.6550us  2.1630us  9.8110us  cudaMemsetAsync
                    0.04%  198.60us       112  1.7730us  1.0630us  7.3200us  cudaStreamCreateWithFlags
                    0.04%  196.05us        32  6.1260us  2.0560us  54.802us  cudaDeviceSynchronize
                    0.03%  142.65us       532     268ns     220ns  1.0650us  cudaDeviceGetAttribute
                    0.02%  109.41us        56  1.9530us  1.4710us  3.6540us  cudaThreadSynchronize
                    0.02%  80.719us         1  80.719us  80.719us  80.719us  cudaMemcpy2DAsync
                    0.01%  76.872us       636     120ns      87ns  1.4580us  cudaSetupArgument
                    0.01%  75.974us       136     558ns     237ns  6.7500us  cudaGetDevice
                    0.01%  64.272us        56  1.1470us  1.0580us  1.2740us  cudaStreamCreateWithPriority
                    0.01%  62.600us        14  4.4710us  3.4300us  6.2920us  cudaStreamCreate
                    0.01%  32.204us       100     322ns     135ns  2.0040us  cudaConfigureCall
                    0.01%  27.397us       134     204ns     122ns     733ns  cudaGetLastError
                    0.00%  26.347us        14  1.8810us  1.5780us  2.7180us  cudaHostGetDevicePointer
                    0.00%  15.666us         9  1.7400us  1.5370us  2.3240us  cudaStreamSynchronize
                    0.00%  9.4850us        14     677ns     579ns     892ns  cudaDeviceGetStreamPriorityRange
                    0.00%  7.1140us        44     161ns     128ns     257ns  cudaPeekAtLastError
                    0.00%  3.0450us        15     203ns     125ns     845ns  cudaGetDeviceCount

TensorRT VERSION None unified memory
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   86.87%  399.18ms        98  4.0732ms  4.8630us  398.03ms  cudaLaunch
                    4.35%  19.997ms        84  238.06us  226.26us  401.58us  cudaGetDeviceProperties
                    1.90%  8.7136ms        14  622.40us  563.19us  715.90us  cudaHostAlloc
                    1.39%  6.3999ms        30  213.33us  25.459us  3.1952ms  cudaMemcpyAsync
                    1.13%  5.2134ms       147  35.465us     814ns  1.5516ms  cudaEventSynchronize
                    0.97%  4.4700ms        14  319.29us  255.07us  781.76us  cudaFreeHost
                    0.87%  4.0073ms       188  21.315us  2.4620us  336.66us  cudaMalloc
                    0.70%  3.2225ms       230  14.010us     488ns  667.09us  cudaFree
                    0.43%  1.9654ms        16  122.84us  5.6400us  412.39us  cudaMallocPitch
                    0.24%  1.1091ms        64  17.329us  4.7470us  59.771us  cudaLaunchKernel
                    0.16%  733.81us        45  16.306us  5.9900us  219.10us  cudaMemcpy
                    0.15%  675.91us        56  12.069us     986ns  387.94us  cudaStreamCreateWithPriority
                    0.10%  454.88us       179  2.5410us  1.2720us  5.2580us  cudaEventRecord
                    0.09%  433.25us         3  144.42us  24.083us  224.10us  cudaMallocManaged
                    0.08%  387.90us       182  2.1310us  1.5020us  10.414us  cudaStreamDestroy
                    0.08%  377.04us       760     496ns     374ns  3.1890us  cudaEventCreateWithFlags
                    0.07%  335.52us       760     441ns     355ns  1.5320us  cudaEventDestroy
                    0.06%  283.76us       112  2.5330us  1.0080us  33.390us  cudaStreamCreateWithFlags
                    0.05%  248.10us        37  6.7050us  1.9500us  55.781us  cudaDeviceSynchronize
                    0.04%  194.49us        44  4.4200us     989ns  27.287us  cudaStreamCreate
                    0.04%  192.58us       147  1.3100us     774ns  4.2910us  cudaEventElapsedTime
                    0.04%  181.30us        56  3.2370us  1.9040us  8.1200us  cudaMemsetAsync
                    0.04%  179.08us         1  179.08us  179.08us  179.08us  cudaMemcpy2D
                    0.03%  134.75us       532     253ns     211ns  4.5540us  cudaDeviceGetAttribute
                    0.02%  103.94us        56  1.8560us  1.4250us  2.3840us  cudaThreadSynchronize
                    0.02%  87.762us       632     138ns      87ns     413ns  cudaSetupArgument
                    0.01%  58.684us       136     431ns     262ns  1.3970us  cudaGetDevice
                    0.01%  32.237us       134     240ns     129ns     599ns  cudaGetLastError
                    0.01%  30.686us        98     313ns     152ns     767ns  cudaConfigureCall
                    0.01%  23.982us        14  1.7130us  1.5170us  2.5270us  cudaHostGetDevicePointer
                    0.00%  21.219us         1  21.219us  21.219us  21.219us  cudaMemcpy2DAsync
                    0.00%  17.485us         9  1.9420us  1.5110us  2.2680us  cudaStreamSynchronize
                    0.00%  9.2380us        14     659ns     532ns  1.0440us  cudaDeviceGetStreamPriorityRange
                    0.00%  8.5240us        44     193ns     128ns     288ns  cudaPeekAtLastError
                    0.00%  3.0480us        15     203ns     140ns     315ns  cudaGetDeviceCount

This issue has been resolved. As stated above we highlight that the slow runtime appears to be caused by Cuda API calls that are being made implicitly in the background. This was indeed the case. I have now worked out where these were coming from: in the examples code (which I was building ontop of), image inference was performed through a method call; in the method was the inference enqueue call; before executing we must create a “context”; this context creation was acting as a block and slowing runtime; it was also performing these other cuda API calls in the background. I have moved context creation to the constructor.

Hi, I was using TensorRT C++ API and found it ran much slower than original PyTorch code.
I used nvprof and also found cudaLaunch consuming too much time.
I did not try your method that moving context to constructor but I found the sampleGooglenet C++ code creating a context and do the inference all in the same function. Their profiling results shows cudaLauch calls is not time-consuming at all, so I am doubting the correctness of your solution.