Deep Learning Accelerator problems

I tried to use DLA, which included in NVIDIA AGX Xavier. And run the official samples, Ex sample_int8. At the same times, I run trtexec to evaluate the different of DLA and GPU mode.
However, Whatever I tried, DLA mode is worse than GPU mode, neither the time consumption nor the accuracy.
Here are a result about DLA and GPU by nvprof:
DLA result:

==2310233== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   24.87%  2.3695ms       400  5.9230us  4.3520us  8.6080us  void cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>, cudnnTensorStruct*, int, float, cudnnTensorStruct*, int, int)
                   24.44%  2.3290ms       419  5.5580us     352ns  6.1440us  [CUDA memcpy HtoD]
                   23.10%  2.2016ms       402  5.4760us  3.7760us  8.5440us  void genericReformat::copyPackedKernel<char, float, bool=1, bool=0, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const *, void*, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
                   23.04%  2.1952ms       401  5.4740us  4.3840us  8.0960us  void genericReformat::copyPackedKernel<float, char, bool=1, bool=0, genericReformat::ArrayN<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::ArrayN<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::ArrayN<int=4>>, void const *, int, int, int, float const *, void*, void const *, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, void const *, int, int, int, float const , int=4)
                    4.30%  409.60us       400  1.0240us     864ns  1.8560us  [CUDA memcpy DtoH]
                    0.25%  24.160us        33     732ns     384ns  1.4080us  [CUDA memset]
      API calls:   43.91%  13.4667s        32  420.83ms  4.6720us  13.4662s  cudaStreamCreateWithFlags
                   18.97%  5.81622s       902  6.4481ms  3.1040us  2.63368s  cudaFree
                   18.30%  5.61057s         2  2.80528s  21.632us  5.61054s  cudaEventCreate
                   13.49%  4.13543s       870  4.7534ms  9.9520us  754.22ms  cudaMalloc
                    3.69%  1.13020s       403  2.8045ms  335.84us  15.705ms  cudaEventSynchronize
                    0.59%  179.63ms       808  222.31us  38.496us  1.8415ms  cudaMemcpy
                    0.36%  109.83ms      1203  91.294us  36.576us  618.02us  cudaLaunchKernel
                    0.09%  29.005ms       400  72.513us  30.720us  5.6223ms  cudaEGLStreamProducerPresentFrame
                    0.07%  22.210ms      1210  18.355us  3.1680us  444.19us  cudaEventRecord
                    0.05%  16.650ms       400  41.623us  22.784us  99.040us  cudaEGLStreamConsumerAcquireFrame
                    0.05%  15.849ms       400  39.621us  20.704us  142.05us  cudaEGLStreamConsumerReleaseFrame
                    0.04%  12.974ms         8  1.6218ms  8.2880us  12.846ms  cudaStreamSynchronize
                    0.04%  11.638ms       991  11.743us  3.2960us  350.37us  cudaEventCreateWithFlags
                    0.04%  11.216ms         4  2.8039ms  2.3883ms  3.1394ms  cudaHostAlloc
                    0.03%  10.007ms       449  22.286us  4.4480us  530.75us  cudaStreamDestroy
                    0.03%  9.5074ms       991  9.5930us  3.1680us  64.192us  cudaEventDestroy
                    0.03%  9.5044ms       401  23.701us  12.864us  148.19us  cudaStreamCreate
                    0.03%  7.8429ms         3  2.6143ms  61.184us  7.6939ms  cudaEGLStreamConsumerConnectWithFlags
                    0.02%  6.6940ms        33  202.85us  14.240us  2.1768ms  cudaMemsetAsync
                    0.02%  5.4356ms       400  13.588us  7.0400us  56.384us  cudaEGLStreamProducerReturnFrame
                    0.02%  5.1582ms        12  429.85us  278.05us  593.60us  cudaGetDeviceProperties
                    0.02%  4.9924ms         4  1.2481ms  1.1382ms  1.3727ms  cudaFreeHost
                    0.02%  4.9497ms       720  6.8740us  1.9840us  671.84us  cudaFuncSetAttribute
                    0.02%  4.7723ms       404  11.812us  6.0800us  95.136us  cudaEventElapsedTime
                    0.01%  4.3118ms       400  10.779us  5.0560us  338.14us  cudaGraphicsResourceGetMappedEglFrame
                    0.01%  3.3066ms        16  206.66us  5.2800us  3.0864ms  cudaStreamCreateWithPriority
                    0.01%  2.9831ms       379  7.8710us     704ns  636.45us  cuDeviceGetAttribute
                    0.01%  2.9752ms         1  2.9752ms  2.9752ms  2.9752ms  cudaEGLStreamProducerConnect
                    0.01%  2.5093ms        12  209.11us  9.2480us  1.6379ms  cudaMemcpyAsync
                    0.01%  2.4265ms        24  101.10us  4.6080us  423.74us  cudaDeviceSynchronize
                    0.01%  1.9866ms       803  2.4730us     640ns  33.184us  cudaGetLastError
                    0.01%  1.7000ms         3  566.67us  29.472us  1.5424ms  cudaEGLStreamConsumerDisconnect
                    0.00%  1.3847ms        26  53.259us  3.7120us  397.76us  cudaGetDevice
                    0.00%  1.2201ms         3  406.69us  180.93us  816.90us  cudaStreamAddCallback
                    0.00%  1.0844ms         4  271.09us  209.25us  420.96us  cuDeviceTotalMem
                    0.00%  965.79us       400  2.4140us     480ns  353.22us  cudaCreateChannelDesc
                    0.00%  821.02us       160  5.1310us  1.1520us  326.69us  cudaDeviceGetAttribute
                    0.00%  405.82us         4  101.46us  12.704us  341.63us  cudaHostGetDevicePointer
                    0.00%  340.29us         1  340.29us  340.29us  340.29us  cuGraphicsUnregisterResource
                    0.00%  331.26us         6  55.210us     960ns  326.02us  cudaRuntimeGetVersion
                    0.00%  322.11us         4  80.528us  1.2480us  317.73us  cudaGetDeviceCount
                    0.00%  287.07us         1  287.07us  287.07us  287.07us  cudaMemGetInfo
                    0.00%  257.02us         1  257.02us  257.02us  257.02us  cudaEGLStreamProducerDisconnect
                    0.00%  24.416us         6  4.0690us  1.2800us  9.0240us  cuDeviceGetCount
                    0.00%  19.968us         4  4.9920us  1.3760us  11.104us  cuDeviceGetUuid
                    0.00%  17.824us         4  4.4560us  2.4000us  6.4320us  cuDeviceGetName
                    0.00%  17.536us         4  4.3840us  3.7120us  5.7600us  cudaDeviceGetStreamPriorityRange
                    0.00%  17.120us         3  5.7060us  5.2160us  6.1760us  cuDriverGetVersion
                    0.00%  12.480us         3  4.1600us  3.2640us  4.7680us  cuInit
                    0.00%  12.352us         5  2.4700us  1.2160us  4.8640us  cuDeviceGet

==2310233== NVTX result:
==2310233==   Thread "<unnamed>" (id = 1)
==2310233==     Domain "TensorRT"
==2310233==       Range "ExecutionContext::enqueue"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  715.94ms       400  1.7899ms  1.3825ms  16.730ms  ExecutionContext::enqueue
 GPU activities:   35.14%  2.3695ms       400  5.9230us  4.3520us  8.6080us  void cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>, cudnnTensorStruct*, int, float, cudnnTensorStruct*, int, int)
                   32.44%  2.1872ms       400  5.4680us  4.3840us  8.0960us  void genericReformat::copyPackedKernel<float, char, bool=1, bool=0, genericReformat::ArrayN<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::ArrayN<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::ArrayN<int=4>>, void const *, int, int, int, float const *, void*, void const *, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, void const *, int, int, int, float const , int=4)
                   32.42%  2.1855ms       400  5.4630us  3.7760us  8.5440us  void genericReformat::copyPackedKernel<char, float, bool=1, bool=0, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const *, void*, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
      API calls:  100.00%  108.87ms      1200  90.723us  36.576us  180.93us  cudaLaunchKernel

==2310233==       Range "data copy finish"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  20.777ms       400  51.942us  30.848us  104.54us  data copy finish
No kernels were profiled in this range.
No API activities were profiled in this range.

==2310233==       Range "data to nvm"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  58.441ms       400  146.10us  63.360us  245.12us  data to nvm
 GPU activities:  100.00%  2.1872ms       400  5.4680us  4.3840us  8.0960us  void genericReformat::copyPackedKernel<float, char, bool=1, bool=0, genericReformat::ArrayN<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::ArrayN<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::ArrayN<int=4>>, void const *, int, int, int, float const *, void*, void const *, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, void const *, int, int, int, float const , int=4)
      API calls:  100.00%  39.559ms       400  98.897us  42.912us  180.93us  cudaLaunchKernel

==2310233==       Range "prob"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  48.300ms       400  120.75us  52.960us  1.8591ms  prob
 GPU activities:  100.00%  2.3695ms       400  5.9230us  4.3520us  8.6080us  void cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>, cudnnTensorStruct*, int, float, cudnnTensorStruct*, int, int)
      API calls:  100.00%  32.333ms       400  80.833us  36.576us  135.17us  cudaLaunchKernel

==2310233==       Range "{conv1,pool1,conv2,pool2,ip1,relu1,ip2}"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  445.11ms       400  1.1128ms  865.89us  11.927ms  {conv1,pool1,conv2,pool2,ip1,relu1,ip2}
No kernels were profiled in this range.
No API activities were profiled in this range.

==2310233==       Range "{conv1,pool1,conv2,pool2,ip1,relu1,ip2} output reformatter 0"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  100.04ms       400  250.11us  132.61us  2.4233ms  {conv1,pool1,conv2,pool2,ip1,relu1,ip2} output reformatter 0
 GPU activities:  100.00%  2.1855ms       400  5.4630us  3.7760us  8.5440us  void genericReformat::copyPackedKernel<char, float, bool=1, bool=0, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const *, void*, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
      API calls:  100.00%  36.975ms       400  92.438us  43.968us  168.58us  cudaLaunchKernel

==2310233==       Range "{conv1,pool1,conv2,pool2,ip1,relu1,ip2} output to be reformatted 0 finish"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  24.482ms       400  61.204us  32.608us  213.57us  {conv1,pool1,conv2,pool2,ip1,relu1,ip2} output to be reformatted 0 finish
No kernels were profiled in this range.
No API activities were profiled in this range.

GPU result:

==2330713== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   27.06%  20.411ms       402  50.774us  28.800us  54.177us  trt_volta_fp32_icudnn_int8x4_128x32_relu_interior_nn_v1
                   17.07%  12.877ms       402  32.031us  30.528us  35.232us  void gemmSN_NN_kernel<float, int=256, int=4, int=2, int=8, int=4, int=4, cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>>(cublasGemmSmallNParams<float const , cublasGemvTensorStridedBatched<float const >, float>)
                   14.24%  10.739ms       402  26.714us  22.912us  39.456us  trt_volta_int8x4_icudnn_int8x4_128x32_relu_small_nn_v1
                    7.15%  5.3972ms       404  13.359us  5.7280us  28.928us  void gemmSN_TN_kernel<float, int=128, int=16, int=2, int=4, int=4, int=4, bool=1, cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>>(cublasGemmSmallNParams<float const , cublasGemvTensorStridedBatched<float const >, float>)
                    3.55%  2.6801ms      1048  2.5570us     320ns  160.23us  [CUDA memcpy HtoD]
                    3.55%  2.6772ms       400  6.6920us  5.8240us  9.1520us  void cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>, cudnnTensorStruct*, int, float, cudnnTensorStruct*, int, int)
                    3.53%  2.6618ms       402  6.6210us  5.8880us  8.0640us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=2, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    3.13%  2.3620ms       814  2.9010us  1.7280us  7.5840us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=0, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
                    2.83%  2.1353ms       402  5.3110us  4.7040us  7.2960us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=5, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    1.94%  1.4667ms       405  3.6210us  3.1680us  6.3360us  cuInt8::nchwToNcqhw4(float const *, unsigned int*, int, int, int, int, int, int, int, float const *, cuInt8::ReducedDivisorParameters)
                    1.77%  1.3321ms       404  3.2970us  2.7200us  4.4800us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=8, cudnnNanPropagation_t=1, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
                    1.29%  976.80us       187  5.2230us  1.2800us  82.145us  [CUDA memcpy DtoD]
                    0.44%  334.30us       138  2.4220us  1.4400us  4.5120us  cask::computeOffsetsKernel(cask::ComputeOffsetsParams)
                    0.43%  323.84us         2  161.92us  129.70us  194.15us  trt_volta_sgemm_128x128_relu_nn_v1
                    0.38%  287.71us       401     717ns     416ns  64.705us  [CUDA memcpy DtoH]
                    0.27%  204.99us         2  102.50us  85.440us  119.55us  trt_volta_sgemm_64x64_relu_nn_v1
                    0.26%  198.15us         2  99.073us  77.505us  120.64us  trt_volta_sgemm_128x64_relu_nn_v1
                    0.26%  197.63us         2  98.816us  61.344us  136.29us  trt_volta_scudnn_128x128_relu_xregs_large_nn_v1
                    0.24%  183.78us         2  91.888us  48.928us  134.85us  trt_volta_scudnn_128x128_relu_medium_nn_v1
                    0.24%  183.71us         2  91.856us  49.344us  134.37us  trt_volta_scudnn_128x128_relu_interior_nn_v1
                    0.24%  183.26us         2  91.632us  47.168us  136.10us  trt_volta_scudnn_128x128_relu_small_nn_v1
                    0.24%  182.56us       308     592ns     288ns  1.6320us  [CUDA memset]
                    0.19%  145.31us        96  1.5130us  1.2480us  2.8480us  cask::reorderImma8816Bias(cask::ReorderImma8816BiasParams)
                    0.19%  140.67us         2  70.336us  39.136us  101.54us  trt_volta_sgemm_32x128_relu_nn_v1
                    0.17%  129.63us         2  64.817us  49.345us  80.289us  trt_volta_int8_i8816cudnn_int8_256x128_ldg16_relu_large_nt_v1
                    0.17%  128.70us         2  64.352us  33.792us  94.912us  trt_volta_scudnn_128x64_relu_xregs_large_nn_v1
                    0.17%  128.06us         2  64.032us  38.112us  89.952us  trt_volta_sgemm_128x32_relu_nn_v1
                    0.17%  126.56us         2  63.280us  33.312us  93.248us  trt_volta_scudnn_128x64_relu_medium_nn_v1
                    0.17%  125.54us         2  62.768us  30.784us  94.752us  trt_volta_scudnn_128x64_relu_interior_nn_v1
                    0.16%  123.58us         2  61.792us  34.272us  89.312us  trt_volta_scudnn_128x64_relu_small_nn_v1
                    0.16%  122.31us         2  61.152us  45.665us  76.640us  trt_volta_int8_i8816cudnn_int8_256x128_ldg16_relu_medium_nt_v1
                    0.16%  120.03us         2  60.016us  45.312us  74.720us  trt_volta_int8_i8816cudnn_int8_256x128_ldg16_relu_small_nt_v1
                    0.16%  119.36us         2  59.680us  44.192us  75.168us  trt_volta_int8_i8816cudnn_int8_256x128_ldg16_relu_interior_nt_v1
                    0.16%  119.17us         2  59.584us  26.144us  93.025us  trt_volta_scudnn_128x32_relu_small_nn_v1
                    0.16%  118.31us         2  59.152us  56.033us  62.272us  trt_volta_fp32_icudnn_int8x4_128x128_relu_xregs_large_nn_v1
                    0.15%  116.67us         2  58.336us  26.112us  90.560us  trt_volta_scudnn_128x32_relu_interior_nn_v1
                    0.15%  116.03us         2  58.016us  30.208us  85.824us  volta_scudnn_128x64_relu_interior_nn_v1
                    0.15%  115.87us         2  57.936us  43.296us  72.576us  trt_volta_int8_i8816cudnn_int8_128x128_ldg16_relu_large_nt_v1
                    0.15%  110.88us        32  3.4650us  2.2400us  5.5360us  cask::reorderImma8816Filter(cask::ReorderImma8816FilterParams)
                    0.15%  110.08us         2  55.040us  53.792us  56.288us  trt_volta_int8x4_icudnn_int8x4_128x128_relu_xregs_large_nn_v1
                    0.14%  107.71us         2  53.856us  51.840us  55.873us  trt_volta_int8x4_icudnn_int8x4_128x128_relu_xregs_large_c32_nn_v1
                    0.14%  107.52us         2  53.760us  48.928us  58.592us  trt_volta_fp32_icudnn_int8x4_128x128_relu_medium_nn_v1
                    0.14%  104.00us         2  52.000us  41.248us  62.753us  trt_volta_int8_i8816cudnn_int8_128x128_ldg16_relu_small_nt_v1
                    0.14%  101.98us         2  50.992us  49.120us  52.864us  trt_volta_fp32_icudnn_int8x4_128x128_relu_interior_nn_v1
                    0.13%  101.44us         2  50.720us  47.841us  53.600us  trt_volta_fp32_icudnn_int8x4_128x128_relu_small_nn_v1
                    0.13%  98.144us         2  49.072us  35.488us  62.656us  trt_volta_int8_i8816cudnn_int8_128x128_ldg16_relu_interior_nt_v1
                    0.13%  98.112us         2  49.056us  35.616us  62.496us  trt_volta_int8_i8816cudnn_int8_128x128_ldg16_relu_medium_nt_v1
                    0.13%  97.344us         2  48.672us  47.872us  49.472us  trt_volta_int8x4_icudnn_int8x4_128x128_relu_small_c32_nn_v1
                    0.13%  96.992us         2  48.496us  43.232us  53.760us  trt_volta_int8x4_icudnn_int8x4_128x128_relu_small_nn_v1
                    0.13%  96.640us         2  48.320us  44.224us  52.416us  trt_volta_int8x4_icudnn_int8x4_128x128_relu_interior_c32_nn_v1
                    0.13%  96.352us         2  48.176us  45.184us  51.168us  trt_volta_int8x4_icudnn_int8x4_128x128_relu_medium_nn_v1
                    0.13%  96.128us         2  48.064us  44.480us  51.648us  trt_volta_int8x4_icudnn_int8x4_128x128_relu_medium_c32_nn_v1
                    0.13%  94.304us         2  47.152us  44.256us  50.048us  trt_volta_int8x4_icudnn_int8x4_128x128_relu_interior_nn_v1
                    0.12%  87.137us         2  43.568us  38.144us  48.993us  trt_volta_fp32_icudnn_int8x4_128x64_relu_medium_nn_v1
                    0.11%  86.753us         2  43.376us  36.033us  50.720us  trt_volta_fp32_icudnn_int8x4_128x32_relu_xregs_small_nn_v1
                    0.11%  86.496us         2  43.248us  21.248us  65.248us  void cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
                    0.11%  84.416us         2  42.208us  40.192us  44.224us  trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_large_nt_v1
                    0.11%  84.384us         2  42.192us  34.848us  49.536us  trt_volta_int8x4_icudnn_int8x4_128x32_relu_xregs_medium_c32_nn_v1
                    0.11%  84.192us         2  42.096us  39.808us  44.384us  trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_singleBuffer_large_nt_v1
                    0.11%  83.776us         2  41.888us  35.264us  48.512us  trt_volta_int8x4_icudnn_int8x4_128x32_relu_xregs_small_nn_v1
                    0.11%  83.392us         2  41.696us  35.456us  47.936us  trt_volta_fp32_icudnn_int8x4_128x32_relu_xregs_medium_nn_v1
                    0.11%  83.137us         2  41.568us  40.449us  42.688us  trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_medium_nt_v1
                    0.11%  82.304us         2  41.152us  35.552us  46.752us  trt_volta_fp32_icudnn_int8x4_128x32_relu_xregs_interior_nn_v1
                    0.11%  82.176us         2  41.088us  36.992us  45.184us  trt_volta_int8x4_icudnn_int8x4_128x32_relu_xregs_small_c32_nn_v1
                    0.11%  82.144us         2  41.072us  32.000us  50.144us  trt_volta_int8x4_icudnn_int8x4_128x64_relu_medium_nn_v1
                    0.11%  81.920us         2  40.960us  40.192us  41.728us  trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_singleBuffer_medium_nt_v1
                    0.11%  81.472us         2  40.736us  38.880us  42.592us  trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_small_nt_v1
                    0.11%  80.928us         2  40.464us  33.344us  47.584us  trt_volta_int8x4_icudnn_int8x4_128x32_relu_xregs_medium_nn_v1
                    0.11%  80.064us         2  40.032us  38.656us  41.408us  trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_singleBuffer_small_nt_v1
                    0.11%  79.456us         2  39.728us  32.992us  46.464us  trt_volta_fp32_icudnn_int8x4_128x64_relu_xregs_large_nn_v1
                    0.11%  79.265us         2  39.632us  39.233us  40.032us  trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_interior_nt_v1
                    0.10%  78.816us         2  39.408us  34.720us  44.096us  trt_volta_int8x4_icudnn_int8x4_128x32_relu_xregs_interior_c32_nn_v1
                    0.10%  78.176us         2  39.088us  33.280us  44.896us  trt_volta_int8x4_icudnn_int8x4_128x64_relu_medium_c32_nn_v1
                    0.10%  77.920us         2  38.960us  32.768us  45.152us  trt_volta_fp32_icudnn_int8x4_128x64_relu_small_nn_v1
                    0.10%  75.968us         2  37.984us  36.480us  39.488us  trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_singleBuffer_interior_nt_v1
                    0.10%  75.712us         2  37.856us  33.152us  42.560us  trt_volta_fp32_icudnn_int8x4_128x64_relu_interior_nn_v1
                    0.10%  75.712us         2  37.856us  34.176us  41.536us  trt_volta_int8x4_icudnn_int8x4_128x64_relu_xregs_large_nn_v1
                    0.10%  75.200us         2  37.600us  35.488us  39.712us  trt_volta_int8x4_icudnn_int8x4_128x64_relu_interior_nn_v1
                    0.10%  74.112us         2  37.056us  9.8560us  64.256us  void im2col4d_kernel<float, int>(im2col4d_params, cudnnConvolutionStruct, cudnnTensor4dStruct, float const *, float*, int)
                    0.10%  73.761us         2  36.880us  31.905us  41.856us  trt_volta_int8x4_icudnn_int8x4_128x64_relu_xregs_large_c32_nn_v1
                    0.10%  73.312us         2  36.656us  31.424us  41.888us  trt_volta_int8x4_icudnn_int8x4_128x64_relu_small_nn_v1
                    0.10%  73.282us         2  36.641us  27.137us  46.145us  trt_volta_int8x4_icudnn_int8x4_128x32_relu_medium_nn_v1
                    0.10%  72.384us         2  36.192us  31.104us  41.280us  trt_volta_int8x4_icudnn_int8x4_128x32_relu_xregs_interior_nn_v1
                    0.09%  71.616us         2  35.808us  28.320us  43.296us  trt_volta_fp32_icudnn_int8x4_128x32_relu_medium_nn_v1
                    0.09%  71.200us         2  35.600us  27.584us  43.616us  trt_volta_fp32_icudnn_int8x4_128x32_relu_small_nn_v1
                    0.09%  70.433us         2  35.216us  29.952us  40.481us  trt_volta_int8x4_icudnn_int8x4_128x64_relu_small_c32_nn_v1
                    0.09%  68.993us         2  34.496us  30.144us  38.849us  trt_volta_int8x4_icudnn_int8x4_128x64_relu_interior_c32_nn_v1
                    0.09%  68.640us         2  34.320us  25.728us  42.912us  trt_volta_int8x4_icudnn_int8x4_128x32_relu_medium_c32_nn_v1
                    0.09%  65.888us         2  32.944us  28.480us  37.408us  trt_volta_int8x4_icudnn_int8x4_128x32_relu_small_c32_nn_v1
                    0.09%  65.824us         1  65.824us  65.824us  65.824us  trt_volta_int8_igemm_int8_128x128_ldg4_relu_nn_v0
                    0.09%  65.728us         2  32.864us  26.848us  38.880us  trt_volta_int8x4_icudnn_int8x4_128x32_relu_interior_c32_nn_v1
                    0.08%  63.456us         2  31.728us  26.496us  36.960us  trt_volta_int8x4_icudnn_int8x4_128x32_relu_interior_nn_v1
                    0.08%  59.072us         1  59.072us  59.072us  59.072us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    0.07%  54.208us         4  13.552us  8.8960us  23.552us  void genericReformat::copyPackedKernel<float, char, bool=1, bool=0, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const *, void*, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
                    0.05%  39.777us         4  9.9440us  8.0010us  14.336us  void genericReformat::copyPackedKernel<char, float, bool=1, bool=0, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const *, void*, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
                    0.05%  37.248us         2  18.624us  15.328us  21.920us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=8, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.05%  34.080us         2  17.040us  16.032us  18.048us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=7, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.04%  32.864us         2  16.432us  13.280us  19.584us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=7, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.04%  32.224us         2  16.112us  14.880us  17.344us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=8, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.04%  29.312us         5  5.8620us  4.1600us  10.720us  cuInt8::ncqhw4ToNchw(char const *, float*, int, int, int, int, int, int, float const *, cuInt8::ReducedDivisorParameters)
                    0.04%  27.776us         2  13.888us  13.632us  14.144us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=6, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.04%  27.616us         2  13.808us  10.912us  16.704us  void gemmSN_NN_kernel<float, int=128, int=2, int=4, int=8, int=4, int=4, cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>>(cublasGemmSmallNParams<float const , cublasGemvTensorStridedBatched<float const >, float>)
                    0.04%  26.656us         2  13.328us  12.928us  13.728us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=5, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.03%  24.896us         2  12.448us  11.616us  13.280us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=6, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.03%  24.833us         2  12.416us  11.488us  13.345us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=5, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.03%  24.384us         1  24.384us  24.384us  24.384us  void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
                    0.03%  23.808us         2  11.904us  11.392us  12.416us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=4, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.03%  21.920us         2  10.960us  10.176us  11.744us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=1, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.03%  21.888us         2  10.944us  10.592us  11.296us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=4, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.03%  21.344us         2  10.672us  10.016us  11.328us  void cudnn::detail::pooling_fw_4d_kernel<float, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, int=0, bool=0>(cudnnTensorStruct, float const *, cudnn::detail::pooling_fw_4d_kernel<float, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, int=0, bool=0>, cudnnTensorStruct*, cudnnPoolingStruct, float, cudnnPoolingStruct, int, cudnn::reduced_divisor, float)
                    0.03%  19.296us         4  4.8240us  4.0960us  6.0160us  cuInt8::nc32hw32ToNcqhw4(char4 const *, char4*, nvinfer1::rt::reduced_divisor, int, nvinfer1::rt::reduced_divisor, nvinfer1::rt::reduced_divisor, int, int)
                    0.02%  18.848us         2  9.4240us  8.9280us  9.9200us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=3, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  17.920us         2  8.9600us  8.3200us  9.6000us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=3, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  17.024us         2  8.5120us  8.2240us  8.8000us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=6, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  16.064us         2  8.0320us  7.3920us  8.6720us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=3, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  15.808us         2  7.9040us  7.8720us  7.9360us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=8, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  15.617us         2  7.8080us  7.5210us  8.0960us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=2, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  15.488us         2  7.7440us  7.7440us  7.7440us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=2, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  15.264us         2  7.6320us  7.2320us  8.0320us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=8, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  14.912us         2  7.4560us  6.5920us  8.3200us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=1, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  14.816us         2  7.4080us  6.4960us  8.3200us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=5, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  14.752us         2  7.3760us  7.2320us  7.5200us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=3, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  14.688us         2  7.3440us  6.3680us  8.3200us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=7, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  14.560us         2  7.2800us  7.1360us  7.4240us  void nvinfer1::poolCHWPackedInt8Quad<nvinfer1::PoolingType>(char4 const *, nvinfer1::poolCHWPackedInt8Quad<nvinfer1::PoolingType>*, int, int, int, int, int, nvinfer1::rt::reduced_divisor, nvinfer1::rt, int, int, int, int, int, int, float, float4 const *, float4 const , int, int)
                    0.02%  14.432us         2  7.2160us  6.5280us  7.9040us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=6, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  14.432us         2  7.2160us  6.7200us  7.7120us  void nvinfer1::poolNCxHWxInt8<nvinfer1::PoolingType>(nvinfer1::IMMAInt8PackedArray const *, nvinfer1::poolNCxHWxInt8<nvinfer1::PoolingType>*, int, int, nvinfer1::rt::reduced_divisor, int, int, int, nvinfer1::rt, int, int, int, int, int, int, nvinfer1::rt, float, float, nvinfer1::IMMAFloatPackedArray const *, nvinfer1::IMMAFloatPackedArray const , int, int)
                    0.02%  14.304us         2  7.1520us  6.0480us  8.2560us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=7, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  14.240us         2  7.1200us  7.1040us  7.1360us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=4, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  13.888us         2  6.9440us  6.6560us  7.2320us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=1, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.02%  13.632us         3  4.5440us  3.9680us  5.4720us  cuInt8::ncqhw4ToNc32hw32(char4 const *, char4*, nvinfer1::rt::reduced_divisor, int, nvinfer1::rt::reduced_divisor, nvinfer1::rt::reduced_divisor, int, int)
                    0.02%  13.312us         2  6.6560us  6.3040us  7.0080us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=1, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.01%  11.200us         2  5.6000us  5.4720us  5.7280us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=4, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.01%  10.784us         2  5.3920us  5.1200us  5.6640us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=2, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    0.01%  6.4640us         2  3.2320us  3.1680us  3.2960us  cudnn::gemm::computeOffsetsKernel(cudnn::gemm::ComputeOffsetsParams)
                    0.01%  4.4160us         1  4.4160us  4.4160us  4.4160us  cuInt8::ncqhw4ToNchw(char const *, char*, int, int, int, int, int, int, cuInt8::ReducedDivisorParameters)
      API calls:   44.34%  30.8231s      1397  22.064ms  10.560us  748.31ms  cudaMalloc
                   25.77%  17.9166s      1429  12.538ms  3.0080us  2.66518s  cudaFree
                   19.50%  13.5546s        32  423.58ms  4.6720us  13.5541s  cudaStreamCreateWithFlags
                    8.11%  5.63520s         2  2.81760s  21.280us  5.63518s  cudaEventCreate
                    1.48%  1.03069s       656  1.5712ms  4.0000us  15.928ms  cudaEventSynchronize
                    0.32%  223.54ms      4945  45.205us  21.440us  966.66us  cudaLaunchKernel
                    0.09%  60.993ms       828  73.662us  14.368us  1.5800ms  cudaMemcpyAsync
                    0.07%  49.828ms       808  61.668us  22.944us  1.6152ms  cudaMemcpy
                    0.05%  37.075ms       256  144.82us  73.120us  874.43us  cudaStreamAddCallback
                    0.05%  36.099ms       308  117.21us  18.656us  1.4974ms  cudaMemsetAsync
                    0.04%  26.018ms      1716  15.161us  2.4320us  520.42us  cudaEventRecord
                    0.02%  12.971ms      1218  10.649us  7.7440us  55.968us  cudaFuncGetAttributes
                    0.02%  12.971ms        11  1.1791ms  5.7600us  12.812ms  cudaStreamSynchronize
                    0.02%  10.639ms         4  2.6598ms  2.3662ms  2.9057ms  cudaHostAlloc
                    0.01%  8.8367ms       993  8.8980us  3.0720us  85.376us  cudaEventDestroy
                    0.01%  8.4672ms       993  8.5260us  3.2320us  371.97us  cudaEventCreateWithFlags
                    0.01%  7.9076ms      5032  1.5710us     992ns  506.88us  cudaDeviceGetAttribute
                    0.01%  6.4732ms       449  14.416us  4.3840us  507.62us  cudaStreamDestroy
                    0.01%  6.0601ms       401  15.112us  8.2560us  102.14us  cudaStreamCreate
                    0.01%  6.0570ms      5322  1.1380us     416ns  407.30us  cudaGetLastError
                    0.01%  5.5872ms      1218  4.5870us  2.5280us  417.86us  cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
                    0.01%  5.2562ms        15  350.41us  226.34us  625.79us  cudaGetDeviceProperties
                    0.01%  5.0499ms       657  7.6860us  2.8480us  52.256us  cudaEventElapsedTime
                    0.01%  4.6349ms         4  1.1587ms  1.0953ms  1.2072ms  cudaFreeHost
                    0.01%  4.5369ms       720  6.3010us  1.8880us  784.00us  cudaFuncSetAttribute
                    0.01%  4.3624ms      1247  3.4980us  1.4400us  534.59us  cudaGetDevice
                    0.00%  3.4085ms        16  213.03us  4.9280us  3.1916ms  cudaStreamCreateWithPriority
                    0.00%  3.1423ms       379  8.2910us     704ns  852.16us  cuDeviceGetAttribute
                    0.00%  1.6859ms         6  280.98us  13.952us  1.5688ms  cudaBindTexture
                    0.00%  1.1122ms         4  278.06us  227.52us  403.87us  cuDeviceTotalMem
                    0.00%  770.21us        24  32.092us  4.2560us  501.22us  cudaDeviceSynchronize
                    0.00%  607.04us         4  151.76us  9.6640us  542.98us  cudaHostGetDevicePointer
                    0.00%  406.18us        12  33.848us     864ns  317.89us  cudaRuntimeGetVersion
                    0.00%  309.06us         4  77.264us  1.1840us  305.34us  cudaGetDeviceCount
                    0.00%  304.38us         1  304.38us  304.38us  304.38us  cudaMemGetInfo
                    0.00%  90.912us         6  15.152us  2.6560us  38.688us  cudaUnbindTexture
                    0.00%  40.000us         6  6.6660us  1.3440us  23.040us  cuDeviceGetCount
                    0.00%  38.272us         4  9.5680us  3.2640us  27.424us  cudaDeviceGetStreamPriorityRange
                    0.00%  18.752us         5  3.7500us  1.1840us  11.584us  cuDeviceGet
                    0.00%  17.312us         3  5.7700us  3.4560us  9.4080us  cuDriverGetVersion
                    0.00%  13.856us         3  4.6180us  3.9680us  5.4720us  cuInit
                    0.00%  12.192us         4  3.0480us  2.3680us  4.5440us  cuDeviceGetName
                    0.00%  10.240us         4  2.5600us  1.7600us  4.4160us  cuDeviceGetUuid

==2330713== NVTX result:
==2330713==   Thread "<unnamed>" (id = 1)
==2330713==     Domain "TensorRT"
==2330713==       Range "ExecutionContext::enqueue"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  277.07ms       400  692.67us  444.99us  2.6562ms  ExecutionContext::enqueue
 GPU activities:   33.00%  20.342ms       400  50.855us  49.056us  54.177us  trt_volta_fp32_icudnn_int8x4_128x32_relu_interior_nn_v1
                   20.78%  12.808ms       400  32.019us  30.528us  35.232us  void gemmSN_NN_kernel<float, int=256, int=4, int=2, int=8, int=4, int=4, cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>>(cublasGemmSmallNParams<float const , cublasGemvTensorStridedBatched<float const >, float>)
                   17.31%  10.673ms       400  26.683us  22.912us  32.192us  trt_volta_int8x4_icudnn_int8x4_128x32_relu_small_nn_v1
                    8.63%  5.3177ms       400  13.294us  12.096us  15.456us  void gemmSN_TN_kernel<float, int=128, int=16, int=2, int=4, int=4, int=4, bool=1, cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>>(cublasGemmSmallNParams<float const , cublasGemvTensorStridedBatched<float const >, float>)
                    4.34%  2.6772ms       400  6.6920us  5.8240us  9.1520us  void cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>, cudnnTensorStruct*, int, float, cudnnTensorStruct*, int, int)
                    4.29%  2.6475ms       400  6.6180us  5.8880us  8.0640us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=2, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    3.73%  2.2989ms       800  2.8730us  1.9840us  5.3760us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=0, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
                    3.45%  2.1248ms       400  5.3120us  4.7040us  7.2960us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=5, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
                    2.34%  1.4401ms       400  3.6000us  3.1680us  5.1840us  cuInt8::nchwToNcqhw4(float const *, unsigned int*, int, int, int, int, int, int, int, float const *, cuInt8::ReducedDivisorParameters)
                    2.14%  1.3175ms       400  3.2930us  3.1040us  4.4800us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=8, cudnnNanPropagation_t=1, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
      API calls:  100.00%  174.32ms      4400  39.618us  21.440us  203.87us  cudaLaunchKernel

==2330713==       Range "conv1"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  23.534ms       400  58.835us  32.928us  276.90us  conv1
 GPU activities:  100.00%  10.673ms       400  26.683us  22.912us  32.192us  trt_volta_int8x4_icudnn_int8x4_128x32_relu_small_nn_v1
      API calls:  100.00%  17.002ms       400  42.506us  23.584us  151.58us  cudaLaunchKernel

==2330713==       Range "conv1 input reformatter 0"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  28.516ms       400  71.291us  40.576us  253.15us  conv1 input reformatter 0
 GPU activities:  100.00%  1.4401ms       400  3.6000us  3.1680us  5.1840us  cuInt8::nchwToNcqhw4(float const *, unsigned int*, int, int, int, int, int, int, int, float const *, cuInt8::ReducedDivisorParameters)
      API calls:  100.00%  20.209ms       400  50.522us  29.184us  178.69us  cudaLaunchKernel

==2330713==       Range "conv2"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  21.559ms       400  53.896us  32.480us  192.19us  conv2
 GPU activities:  100.00%  20.342ms       400  50.855us  49.056us  54.177us  trt_volta_fp32_icudnn_int8x4_128x32_relu_interior_nn_v1
      API calls:  100.00%  16.013ms       400  40.032us  23.616us  181.41us  cudaLaunchKernel

==2330713==       Range "ip1 + relu1"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  83.385ms       400  208.46us  139.17us  536.70us  ip1 + relu1
 GPU activities:   82.45%  12.808ms       400  32.019us  30.528us  35.232us  void gemmSN_NN_kernel<float, int=256, int=4, int=2, int=8, int=4, int=4, cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>>(cublasGemmSmallNParams<float const , cublasGemvTensorStridedBatched<float const >, float>)
                    9.07%  1.4093ms       400  3.5230us  2.9120us  5.3760us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=0, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
                    8.48%  1.3175ms       400  3.2930us  3.1040us  4.4800us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=8, cudnnNanPropagation_t=1, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
      API calls:  100.00%  46.532ms      1200  38.777us  22.880us  171.90us  cudaLaunchKernel

==2330713==       Range "ip2"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  45.353ms       400  113.38us  81.280us  299.84us   ip2
 GPU activities:   85.67%  5.3177ms       400  13.294us  12.096us  15.456us  void gemmSN_TN_kernel<float, int=128, int=16, int=2, int=4, int=4, int=4, bool=1, cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>>(cublasGemmSmallNParams<float const , cublasGemvTensorStridedBatched<float const >, float>)
                   14.33%  889.60us       400  2.2240us  1.9840us  3.3920us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=0, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
      API calls:  100.00%  28.028ms       800  35.034us  21.440us  191.42us  cudaLaunchKernel

==2330713==       Range "pool1"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  21.144ms       400  52.860us  30.528us  168.48us  pool1
 GPU activities:  100.00%  2.6475ms       400  6.6180us  5.8880us  8.0640us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=2, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
      API calls:  100.00%  16.329ms       400  40.822us  22.944us  128.03us  cudaLaunchKernel

==2330713==       Range "pool2"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  19.760ms       400  49.400us  30.112us  216.64us  pool2
 GPU activities:  100.00%  2.1248ms       400  5.3120us  4.7040us  7.2960us  void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=5, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
      API calls:  100.00%  15.744ms       400  39.360us  22.944us  203.87us  cudaLaunchKernel

==2330713==       Range "prob"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  20.761ms       400  51.902us  31.104us  1.6653ms  prob
 GPU activities:  100.00%  2.6772ms       400  6.6920us  5.8240us  9.1520us  void cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>, cudnnTensorStruct*, int, float, cudnnTensorStruct*, int, int)
      API calls:  100.00%  14.465ms       400  36.162us  22.528us  140.45us  cudaLaunchKernel
And some people ask similar questions in forum, like 

https://devtalk.nvidia.com/default/topic/1056139/general/-after-using-dla-the-speed-is-slower/
https://devtalk.nvidia.com/default/topic/1047422/general/decreased-performance-from-fp16-to-int8-in-tf-trt-on-jetson-xavier/post/5315733/#5315733
https://devtalk.nvidia.com/default/topic/1051679/jetson-agx-xavier/tensorrt-log-info-about-nvdla-/post/5338061/#5338061

The above posts are report DLA is slower than GPU, but i can’t see a clear solution or description.

Here are my questions:

  1. Does DLA is really slower than GPU? If yes, What is the used of DLA?
  2. The log above DLA is fused the nerual network layers,ex:"{conv1,pool1,conv2,pool2,ip1,relu1,ip2}", How does it work? and Can it split?
  3. How to analyze or monitor the situation about DLA when I run the program in DLA. I can’t see it through nvprof?

Thanks.

Dear cwcai,
DLA is another accelerator like GPU. Note that a single DLA offers 5 INT8 TOPS where integrated GPU offers 20 INT8 TOPS.
Currently DLA does not support all layers. Please check https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#dla_layers to know supported layers. Unsupported layers can fall back to GPU.
Layer fusion is one of the optimization performed by TensorRT.