I tried to use DLA, which included in NVIDIA AGX Xavier. And run the official samples, Ex sample_int8. At the same times, I run trtexec to evaluate the different of DLA and GPU mode.
However, Whatever I tried, DLA mode is worse than GPU mode, neither the time consumption nor the accuracy.
Here are a result about DLA and GPU by nvprof:
DLA result:
==2310233== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 24.87% 2.3695ms 400 5.9230us 4.3520us 8.6080us void cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>, cudnnTensorStruct*, int, float, cudnnTensorStruct*, int, int)
24.44% 2.3290ms 419 5.5580us 352ns 6.1440us [CUDA memcpy HtoD]
23.10% 2.2016ms 402 5.4760us 3.7760us 8.5440us void genericReformat::copyPackedKernel<char, float, bool=1, bool=0, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const *, void*, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
23.04% 2.1952ms 401 5.4740us 4.3840us 8.0960us void genericReformat::copyPackedKernel<float, char, bool=1, bool=0, genericReformat::ArrayN<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::ArrayN<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::ArrayN<int=4>>, void const *, int, int, int, float const *, void*, void const *, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, void const *, int, int, int, float const , int=4)
4.30% 409.60us 400 1.0240us 864ns 1.8560us [CUDA memcpy DtoH]
0.25% 24.160us 33 732ns 384ns 1.4080us [CUDA memset]
API calls: 43.91% 13.4667s 32 420.83ms 4.6720us 13.4662s cudaStreamCreateWithFlags
18.97% 5.81622s 902 6.4481ms 3.1040us 2.63368s cudaFree
18.30% 5.61057s 2 2.80528s 21.632us 5.61054s cudaEventCreate
13.49% 4.13543s 870 4.7534ms 9.9520us 754.22ms cudaMalloc
3.69% 1.13020s 403 2.8045ms 335.84us 15.705ms cudaEventSynchronize
0.59% 179.63ms 808 222.31us 38.496us 1.8415ms cudaMemcpy
0.36% 109.83ms 1203 91.294us 36.576us 618.02us cudaLaunchKernel
0.09% 29.005ms 400 72.513us 30.720us 5.6223ms cudaEGLStreamProducerPresentFrame
0.07% 22.210ms 1210 18.355us 3.1680us 444.19us cudaEventRecord
0.05% 16.650ms 400 41.623us 22.784us 99.040us cudaEGLStreamConsumerAcquireFrame
0.05% 15.849ms 400 39.621us 20.704us 142.05us cudaEGLStreamConsumerReleaseFrame
0.04% 12.974ms 8 1.6218ms 8.2880us 12.846ms cudaStreamSynchronize
0.04% 11.638ms 991 11.743us 3.2960us 350.37us cudaEventCreateWithFlags
0.04% 11.216ms 4 2.8039ms 2.3883ms 3.1394ms cudaHostAlloc
0.03% 10.007ms 449 22.286us 4.4480us 530.75us cudaStreamDestroy
0.03% 9.5074ms 991 9.5930us 3.1680us 64.192us cudaEventDestroy
0.03% 9.5044ms 401 23.701us 12.864us 148.19us cudaStreamCreate
0.03% 7.8429ms 3 2.6143ms 61.184us 7.6939ms cudaEGLStreamConsumerConnectWithFlags
0.02% 6.6940ms 33 202.85us 14.240us 2.1768ms cudaMemsetAsync
0.02% 5.4356ms 400 13.588us 7.0400us 56.384us cudaEGLStreamProducerReturnFrame
0.02% 5.1582ms 12 429.85us 278.05us 593.60us cudaGetDeviceProperties
0.02% 4.9924ms 4 1.2481ms 1.1382ms 1.3727ms cudaFreeHost
0.02% 4.9497ms 720 6.8740us 1.9840us 671.84us cudaFuncSetAttribute
0.02% 4.7723ms 404 11.812us 6.0800us 95.136us cudaEventElapsedTime
0.01% 4.3118ms 400 10.779us 5.0560us 338.14us cudaGraphicsResourceGetMappedEglFrame
0.01% 3.3066ms 16 206.66us 5.2800us 3.0864ms cudaStreamCreateWithPriority
0.01% 2.9831ms 379 7.8710us 704ns 636.45us cuDeviceGetAttribute
0.01% 2.9752ms 1 2.9752ms 2.9752ms 2.9752ms cudaEGLStreamProducerConnect
0.01% 2.5093ms 12 209.11us 9.2480us 1.6379ms cudaMemcpyAsync
0.01% 2.4265ms 24 101.10us 4.6080us 423.74us cudaDeviceSynchronize
0.01% 1.9866ms 803 2.4730us 640ns 33.184us cudaGetLastError
0.01% 1.7000ms 3 566.67us 29.472us 1.5424ms cudaEGLStreamConsumerDisconnect
0.00% 1.3847ms 26 53.259us 3.7120us 397.76us cudaGetDevice
0.00% 1.2201ms 3 406.69us 180.93us 816.90us cudaStreamAddCallback
0.00% 1.0844ms 4 271.09us 209.25us 420.96us cuDeviceTotalMem
0.00% 965.79us 400 2.4140us 480ns 353.22us cudaCreateChannelDesc
0.00% 821.02us 160 5.1310us 1.1520us 326.69us cudaDeviceGetAttribute
0.00% 405.82us 4 101.46us 12.704us 341.63us cudaHostGetDevicePointer
0.00% 340.29us 1 340.29us 340.29us 340.29us cuGraphicsUnregisterResource
0.00% 331.26us 6 55.210us 960ns 326.02us cudaRuntimeGetVersion
0.00% 322.11us 4 80.528us 1.2480us 317.73us cudaGetDeviceCount
0.00% 287.07us 1 287.07us 287.07us 287.07us cudaMemGetInfo
0.00% 257.02us 1 257.02us 257.02us 257.02us cudaEGLStreamProducerDisconnect
0.00% 24.416us 6 4.0690us 1.2800us 9.0240us cuDeviceGetCount
0.00% 19.968us 4 4.9920us 1.3760us 11.104us cuDeviceGetUuid
0.00% 17.824us 4 4.4560us 2.4000us 6.4320us cuDeviceGetName
0.00% 17.536us 4 4.3840us 3.7120us 5.7600us cudaDeviceGetStreamPriorityRange
0.00% 17.120us 3 5.7060us 5.2160us 6.1760us cuDriverGetVersion
0.00% 12.480us 3 4.1600us 3.2640us 4.7680us cuInit
0.00% 12.352us 5 2.4700us 1.2160us 4.8640us cuDeviceGet
==2310233== NVTX result:
==2310233== Thread "<unnamed>" (id = 1)
==2310233== Domain "TensorRT"
==2310233== Range "ExecutionContext::enqueue"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 715.94ms 400 1.7899ms 1.3825ms 16.730ms ExecutionContext::enqueue
GPU activities: 35.14% 2.3695ms 400 5.9230us 4.3520us 8.6080us void cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>, cudnnTensorStruct*, int, float, cudnnTensorStruct*, int, int)
32.44% 2.1872ms 400 5.4680us 4.3840us 8.0960us void genericReformat::copyPackedKernel<float, char, bool=1, bool=0, genericReformat::ArrayN<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::ArrayN<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::ArrayN<int=4>>, void const *, int, int, int, float const *, void*, void const *, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, void const *, int, int, int, float const , int=4)
32.42% 2.1855ms 400 5.4630us 3.7760us 8.5440us void genericReformat::copyPackedKernel<char, float, bool=1, bool=0, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const *, void*, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
API calls: 100.00% 108.87ms 1200 90.723us 36.576us 180.93us cudaLaunchKernel
==2310233== Range "data copy finish"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 20.777ms 400 51.942us 30.848us 104.54us data copy finish
No kernels were profiled in this range.
No API activities were profiled in this range.
==2310233== Range "data to nvm"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 58.441ms 400 146.10us 63.360us 245.12us data to nvm
GPU activities: 100.00% 2.1872ms 400 5.4680us 4.3840us 8.0960us void genericReformat::copyPackedKernel<float, char, bool=1, bool=0, genericReformat::ArrayN<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::ArrayN<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::ArrayN<int=4>>, void const *, int, int, int, float const *, void*, void const *, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, void const *, int, int, int, float const , int=4)
API calls: 100.00% 39.559ms 400 98.897us 42.912us 180.93us cudaLaunchKernel
==2310233== Range "prob"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 48.300ms 400 120.75us 52.960us 1.8591ms prob
GPU activities: 100.00% 2.3695ms 400 5.9230us 4.3520us 8.6080us void cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>, cudnnTensorStruct*, int, float, cudnnTensorStruct*, int, int)
API calls: 100.00% 32.333ms 400 80.833us 36.576us 135.17us cudaLaunchKernel
==2310233== Range "{conv1,pool1,conv2,pool2,ip1,relu1,ip2}"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 445.11ms 400 1.1128ms 865.89us 11.927ms {conv1,pool1,conv2,pool2,ip1,relu1,ip2}
No kernels were profiled in this range.
No API activities were profiled in this range.
==2310233== Range "{conv1,pool1,conv2,pool2,ip1,relu1,ip2} output reformatter 0"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 100.04ms 400 250.11us 132.61us 2.4233ms {conv1,pool1,conv2,pool2,ip1,relu1,ip2} output reformatter 0
GPU activities: 100.00% 2.1855ms 400 5.4630us 3.7760us 8.5440us void genericReformat::copyPackedKernel<char, float, bool=1, bool=0, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const *, void*, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
API calls: 100.00% 36.975ms 400 92.438us 43.968us 168.58us cudaLaunchKernel
==2310233== Range "{conv1,pool1,conv2,pool2,ip1,relu1,ip2} output to be reformatted 0 finish"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 24.482ms 400 61.204us 32.608us 213.57us {conv1,pool1,conv2,pool2,ip1,relu1,ip2} output to be reformatted 0 finish
No kernels were profiled in this range.
No API activities were profiled in this range.
GPU result:
==2330713== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 27.06% 20.411ms 402 50.774us 28.800us 54.177us trt_volta_fp32_icudnn_int8x4_128x32_relu_interior_nn_v1
17.07% 12.877ms 402 32.031us 30.528us 35.232us void gemmSN_NN_kernel<float, int=256, int=4, int=2, int=8, int=4, int=4, cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>>(cublasGemmSmallNParams<float const , cublasGemvTensorStridedBatched<float const >, float>)
14.24% 10.739ms 402 26.714us 22.912us 39.456us trt_volta_int8x4_icudnn_int8x4_128x32_relu_small_nn_v1
7.15% 5.3972ms 404 13.359us 5.7280us 28.928us void gemmSN_TN_kernel<float, int=128, int=16, int=2, int=4, int=4, int=4, bool=1, cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>>(cublasGemmSmallNParams<float const , cublasGemvTensorStridedBatched<float const >, float>)
3.55% 2.6801ms 1048 2.5570us 320ns 160.23us [CUDA memcpy HtoD]
3.55% 2.6772ms 400 6.6920us 5.8240us 9.1520us void cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>, cudnnTensorStruct*, int, float, cudnnTensorStruct*, int, int)
3.53% 2.6618ms 402 6.6210us 5.8880us 8.0640us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=2, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
3.13% 2.3620ms 814 2.9010us 1.7280us 7.5840us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=0, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
2.83% 2.1353ms 402 5.3110us 4.7040us 7.2960us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=5, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
1.94% 1.4667ms 405 3.6210us 3.1680us 6.3360us cuInt8::nchwToNcqhw4(float const *, unsigned int*, int, int, int, int, int, int, int, float const *, cuInt8::ReducedDivisorParameters)
1.77% 1.3321ms 404 3.2970us 2.7200us 4.4800us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=8, cudnnNanPropagation_t=1, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
1.29% 976.80us 187 5.2230us 1.2800us 82.145us [CUDA memcpy DtoD]
0.44% 334.30us 138 2.4220us 1.4400us 4.5120us cask::computeOffsetsKernel(cask::ComputeOffsetsParams)
0.43% 323.84us 2 161.92us 129.70us 194.15us trt_volta_sgemm_128x128_relu_nn_v1
0.38% 287.71us 401 717ns 416ns 64.705us [CUDA memcpy DtoH]
0.27% 204.99us 2 102.50us 85.440us 119.55us trt_volta_sgemm_64x64_relu_nn_v1
0.26% 198.15us 2 99.073us 77.505us 120.64us trt_volta_sgemm_128x64_relu_nn_v1
0.26% 197.63us 2 98.816us 61.344us 136.29us trt_volta_scudnn_128x128_relu_xregs_large_nn_v1
0.24% 183.78us 2 91.888us 48.928us 134.85us trt_volta_scudnn_128x128_relu_medium_nn_v1
0.24% 183.71us 2 91.856us 49.344us 134.37us trt_volta_scudnn_128x128_relu_interior_nn_v1
0.24% 183.26us 2 91.632us 47.168us 136.10us trt_volta_scudnn_128x128_relu_small_nn_v1
0.24% 182.56us 308 592ns 288ns 1.6320us [CUDA memset]
0.19% 145.31us 96 1.5130us 1.2480us 2.8480us cask::reorderImma8816Bias(cask::ReorderImma8816BiasParams)
0.19% 140.67us 2 70.336us 39.136us 101.54us trt_volta_sgemm_32x128_relu_nn_v1
0.17% 129.63us 2 64.817us 49.345us 80.289us trt_volta_int8_i8816cudnn_int8_256x128_ldg16_relu_large_nt_v1
0.17% 128.70us 2 64.352us 33.792us 94.912us trt_volta_scudnn_128x64_relu_xregs_large_nn_v1
0.17% 128.06us 2 64.032us 38.112us 89.952us trt_volta_sgemm_128x32_relu_nn_v1
0.17% 126.56us 2 63.280us 33.312us 93.248us trt_volta_scudnn_128x64_relu_medium_nn_v1
0.17% 125.54us 2 62.768us 30.784us 94.752us trt_volta_scudnn_128x64_relu_interior_nn_v1
0.16% 123.58us 2 61.792us 34.272us 89.312us trt_volta_scudnn_128x64_relu_small_nn_v1
0.16% 122.31us 2 61.152us 45.665us 76.640us trt_volta_int8_i8816cudnn_int8_256x128_ldg16_relu_medium_nt_v1
0.16% 120.03us 2 60.016us 45.312us 74.720us trt_volta_int8_i8816cudnn_int8_256x128_ldg16_relu_small_nt_v1
0.16% 119.36us 2 59.680us 44.192us 75.168us trt_volta_int8_i8816cudnn_int8_256x128_ldg16_relu_interior_nt_v1
0.16% 119.17us 2 59.584us 26.144us 93.025us trt_volta_scudnn_128x32_relu_small_nn_v1
0.16% 118.31us 2 59.152us 56.033us 62.272us trt_volta_fp32_icudnn_int8x4_128x128_relu_xregs_large_nn_v1
0.15% 116.67us 2 58.336us 26.112us 90.560us trt_volta_scudnn_128x32_relu_interior_nn_v1
0.15% 116.03us 2 58.016us 30.208us 85.824us volta_scudnn_128x64_relu_interior_nn_v1
0.15% 115.87us 2 57.936us 43.296us 72.576us trt_volta_int8_i8816cudnn_int8_128x128_ldg16_relu_large_nt_v1
0.15% 110.88us 32 3.4650us 2.2400us 5.5360us cask::reorderImma8816Filter(cask::ReorderImma8816FilterParams)
0.15% 110.08us 2 55.040us 53.792us 56.288us trt_volta_int8x4_icudnn_int8x4_128x128_relu_xregs_large_nn_v1
0.14% 107.71us 2 53.856us 51.840us 55.873us trt_volta_int8x4_icudnn_int8x4_128x128_relu_xregs_large_c32_nn_v1
0.14% 107.52us 2 53.760us 48.928us 58.592us trt_volta_fp32_icudnn_int8x4_128x128_relu_medium_nn_v1
0.14% 104.00us 2 52.000us 41.248us 62.753us trt_volta_int8_i8816cudnn_int8_128x128_ldg16_relu_small_nt_v1
0.14% 101.98us 2 50.992us 49.120us 52.864us trt_volta_fp32_icudnn_int8x4_128x128_relu_interior_nn_v1
0.13% 101.44us 2 50.720us 47.841us 53.600us trt_volta_fp32_icudnn_int8x4_128x128_relu_small_nn_v1
0.13% 98.144us 2 49.072us 35.488us 62.656us trt_volta_int8_i8816cudnn_int8_128x128_ldg16_relu_interior_nt_v1
0.13% 98.112us 2 49.056us 35.616us 62.496us trt_volta_int8_i8816cudnn_int8_128x128_ldg16_relu_medium_nt_v1
0.13% 97.344us 2 48.672us 47.872us 49.472us trt_volta_int8x4_icudnn_int8x4_128x128_relu_small_c32_nn_v1
0.13% 96.992us 2 48.496us 43.232us 53.760us trt_volta_int8x4_icudnn_int8x4_128x128_relu_small_nn_v1
0.13% 96.640us 2 48.320us 44.224us 52.416us trt_volta_int8x4_icudnn_int8x4_128x128_relu_interior_c32_nn_v1
0.13% 96.352us 2 48.176us 45.184us 51.168us trt_volta_int8x4_icudnn_int8x4_128x128_relu_medium_nn_v1
0.13% 96.128us 2 48.064us 44.480us 51.648us trt_volta_int8x4_icudnn_int8x4_128x128_relu_medium_c32_nn_v1
0.13% 94.304us 2 47.152us 44.256us 50.048us trt_volta_int8x4_icudnn_int8x4_128x128_relu_interior_nn_v1
0.12% 87.137us 2 43.568us 38.144us 48.993us trt_volta_fp32_icudnn_int8x4_128x64_relu_medium_nn_v1
0.11% 86.753us 2 43.376us 36.033us 50.720us trt_volta_fp32_icudnn_int8x4_128x32_relu_xregs_small_nn_v1
0.11% 86.496us 2 43.248us 21.248us 65.248us void cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>(int, int, int, float const *, int, float const , int, cudnn::detail::explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=1>*, kernel_conv_params, int, int, float, float, int, float const *, float const *)
0.11% 84.416us 2 42.208us 40.192us 44.224us trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_large_nt_v1
0.11% 84.384us 2 42.192us 34.848us 49.536us trt_volta_int8x4_icudnn_int8x4_128x32_relu_xregs_medium_c32_nn_v1
0.11% 84.192us 2 42.096us 39.808us 44.384us trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_singleBuffer_large_nt_v1
0.11% 83.776us 2 41.888us 35.264us 48.512us trt_volta_int8x4_icudnn_int8x4_128x32_relu_xregs_small_nn_v1
0.11% 83.392us 2 41.696us 35.456us 47.936us trt_volta_fp32_icudnn_int8x4_128x32_relu_xregs_medium_nn_v1
0.11% 83.137us 2 41.568us 40.449us 42.688us trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_medium_nt_v1
0.11% 82.304us 2 41.152us 35.552us 46.752us trt_volta_fp32_icudnn_int8x4_128x32_relu_xregs_interior_nn_v1
0.11% 82.176us 2 41.088us 36.992us 45.184us trt_volta_int8x4_icudnn_int8x4_128x32_relu_xregs_small_c32_nn_v1
0.11% 82.144us 2 41.072us 32.000us 50.144us trt_volta_int8x4_icudnn_int8x4_128x64_relu_medium_nn_v1
0.11% 81.920us 2 40.960us 40.192us 41.728us trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_singleBuffer_medium_nt_v1
0.11% 81.472us 2 40.736us 38.880us 42.592us trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_small_nt_v1
0.11% 80.928us 2 40.464us 33.344us 47.584us trt_volta_int8x4_icudnn_int8x4_128x32_relu_xregs_medium_nn_v1
0.11% 80.064us 2 40.032us 38.656us 41.408us trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_singleBuffer_small_nt_v1
0.11% 79.456us 2 39.728us 32.992us 46.464us trt_volta_fp32_icudnn_int8x4_128x64_relu_xregs_large_nn_v1
0.11% 79.265us 2 39.632us 39.233us 40.032us trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_interior_nt_v1
0.10% 78.816us 2 39.408us 34.720us 44.096us trt_volta_int8x4_icudnn_int8x4_128x32_relu_xregs_interior_c32_nn_v1
0.10% 78.176us 2 39.088us 33.280us 44.896us trt_volta_int8x4_icudnn_int8x4_128x64_relu_medium_c32_nn_v1
0.10% 77.920us 2 38.960us 32.768us 45.152us trt_volta_fp32_icudnn_int8x4_128x64_relu_small_nn_v1
0.10% 75.968us 2 37.984us 36.480us 39.488us trt_volta_int8_i8816cudnn_int8_256x64_ldg16_relu_singleBuffer_interior_nt_v1
0.10% 75.712us 2 37.856us 33.152us 42.560us trt_volta_fp32_icudnn_int8x4_128x64_relu_interior_nn_v1
0.10% 75.712us 2 37.856us 34.176us 41.536us trt_volta_int8x4_icudnn_int8x4_128x64_relu_xregs_large_nn_v1
0.10% 75.200us 2 37.600us 35.488us 39.712us trt_volta_int8x4_icudnn_int8x4_128x64_relu_interior_nn_v1
0.10% 74.112us 2 37.056us 9.8560us 64.256us void im2col4d_kernel<float, int>(im2col4d_params, cudnnConvolutionStruct, cudnnTensor4dStruct, float const *, float*, int)
0.10% 73.761us 2 36.880us 31.905us 41.856us trt_volta_int8x4_icudnn_int8x4_128x64_relu_xregs_large_c32_nn_v1
0.10% 73.312us 2 36.656us 31.424us 41.888us trt_volta_int8x4_icudnn_int8x4_128x64_relu_small_nn_v1
0.10% 73.282us 2 36.641us 27.137us 46.145us trt_volta_int8x4_icudnn_int8x4_128x32_relu_medium_nn_v1
0.10% 72.384us 2 36.192us 31.104us 41.280us trt_volta_int8x4_icudnn_int8x4_128x32_relu_xregs_interior_nn_v1
0.09% 71.616us 2 35.808us 28.320us 43.296us trt_volta_fp32_icudnn_int8x4_128x32_relu_medium_nn_v1
0.09% 71.200us 2 35.600us 27.584us 43.616us trt_volta_fp32_icudnn_int8x4_128x32_relu_small_nn_v1
0.09% 70.433us 2 35.216us 29.952us 40.481us trt_volta_int8x4_icudnn_int8x4_128x64_relu_small_c32_nn_v1
0.09% 68.993us 2 34.496us 30.144us 38.849us trt_volta_int8x4_icudnn_int8x4_128x64_relu_interior_c32_nn_v1
0.09% 68.640us 2 34.320us 25.728us 42.912us trt_volta_int8x4_icudnn_int8x4_128x32_relu_medium_c32_nn_v1
0.09% 65.888us 2 32.944us 28.480us 37.408us trt_volta_int8x4_icudnn_int8x4_128x32_relu_small_c32_nn_v1
0.09% 65.824us 1 65.824us 65.824us 65.824us trt_volta_int8_igemm_int8_128x128_ldg4_relu_nn_v0
0.09% 65.728us 2 32.864us 26.848us 38.880us trt_volta_int8x4_icudnn_int8x4_128x32_relu_interior_c32_nn_v1
0.08% 63.456us 2 31.728us 26.496us 36.960us trt_volta_int8x4_icudnn_int8x4_128x32_relu_interior_nn_v1
0.08% 59.072us 1 59.072us 59.072us 59.072us void cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
0.07% 54.208us 4 13.552us 8.8960us 23.552us void genericReformat::copyPackedKernel<float, char, bool=1, bool=0, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const *, void*, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
0.05% 39.777us 4 9.9440us 8.0010us 14.336us void genericReformat::copyPackedKernel<char, float, bool=1, bool=0, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const *, void*, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
0.05% 37.248us 2 18.624us 15.328us 21.920us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=8, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.05% 34.080us 2 17.040us 16.032us 18.048us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=7, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.04% 32.864us 2 16.432us 13.280us 19.584us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=7, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.04% 32.224us 2 16.112us 14.880us 17.344us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=8, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.04% 29.312us 5 5.8620us 4.1600us 10.720us cuInt8::ncqhw4ToNchw(char const *, float*, int, int, int, int, int, int, float const *, cuInt8::ReducedDivisorParameters)
0.04% 27.776us 2 13.888us 13.632us 14.144us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=6, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.04% 27.616us 2 13.808us 10.912us 16.704us void gemmSN_NN_kernel<float, int=128, int=2, int=4, int=8, int=4, int=4, cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>>(cublasGemmSmallNParams<float const , cublasGemvTensorStridedBatched<float const >, float>)
0.04% 26.656us 2 13.328us 12.928us 13.728us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=5, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.03% 24.896us 2 12.448us 11.616us 13.280us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=6, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.03% 24.833us 2 12.416us 11.488us 13.345us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=5, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.03% 24.384us 1 24.384us 24.384us 24.384us void cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const *, int, float*, cudnn::detail::implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>*, kernel_conv_params, int, float, float, int, float, float, int, int)
0.03% 23.808us 2 11.904us 11.392us 12.416us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=4, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.03% 21.920us 2 10.960us 10.176us 11.744us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=1, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.03% 21.888us 2 10.944us 10.592us 11.296us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=4, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.03% 21.344us 2 10.672us 10.016us 11.328us void cudnn::detail::pooling_fw_4d_kernel<float, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, int=0, bool=0>(cudnnTensorStruct, float const *, cudnn::detail::pooling_fw_4d_kernel<float, float, cudnn::detail::maxpooling_func<float, cudnnNanPropagation_t=0>, int=0, bool=0>, cudnnTensorStruct*, cudnnPoolingStruct, float, cudnnPoolingStruct, int, cudnn::reduced_divisor, float)
0.03% 19.296us 4 4.8240us 4.0960us 6.0160us cuInt8::nc32hw32ToNcqhw4(char4 const *, char4*, nvinfer1::rt::reduced_divisor, int, nvinfer1::rt::reduced_divisor, nvinfer1::rt::reduced_divisor, int, int)
0.02% 18.848us 2 9.4240us 8.9280us 9.9200us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=3, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 17.920us 2 8.9600us 8.3200us 9.6000us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=3, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 17.024us 2 8.5120us 8.2240us 8.8000us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=6, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 16.064us 2 8.0320us 7.3920us 8.6720us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=3, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 15.808us 2 7.9040us 7.8720us 7.9360us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=8, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 15.617us 2 7.8080us 7.5210us 8.0960us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=2, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 15.488us 2 7.7440us 7.7440us 7.7440us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=2, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 15.264us 2 7.6320us 7.2320us 8.0320us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=8, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 14.912us 2 7.4560us 6.5920us 8.3200us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=1, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 14.816us 2 7.4080us 6.4960us 8.3200us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=5, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 14.752us 2 7.3760us 7.2320us 7.5200us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=3, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 14.688us 2 7.3440us 6.3680us 8.3200us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=7, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 14.560us 2 7.2800us 7.1360us 7.4240us void nvinfer1::poolCHWPackedInt8Quad<nvinfer1::PoolingType>(char4 const *, nvinfer1::poolCHWPackedInt8Quad<nvinfer1::PoolingType>*, int, int, int, int, int, nvinfer1::rt::reduced_divisor, nvinfer1::rt, int, int, int, int, int, int, float, float4 const *, float4 const , int, int)
0.02% 14.432us 2 7.2160us 6.5280us 7.9040us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=6, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 14.432us 2 7.2160us 6.7200us 7.7120us void nvinfer1::poolNCxHWxInt8<nvinfer1::PoolingType>(nvinfer1::IMMAInt8PackedArray const *, nvinfer1::poolNCxHWxInt8<nvinfer1::PoolingType>*, int, int, nvinfer1::rt::reduced_divisor, int, int, int, nvinfer1::rt, int, int, int, int, int, int, nvinfer1::rt, float, float, nvinfer1::IMMAFloatPackedArray const *, nvinfer1::IMMAFloatPackedArray const , int, int)
0.02% 14.304us 2 7.1520us 6.0480us 8.2560us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=7, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 14.240us 2 7.1200us 7.1040us 7.1360us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=4, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 13.888us 2 6.9440us 6.6560us 7.2320us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=1, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 13.632us 3 4.5440us 3.9680us 5.4720us cuInt8::ncqhw4ToNc32hw32(char4 const *, char4*, nvinfer1::rt::reduced_divisor, int, nvinfer1::rt::reduced_divisor, nvinfer1::rt::reduced_divisor, int, int)
0.02% 13.312us 2 6.6560us 6.3040us 7.0080us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=1, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.01% 11.200us 2 5.6000us 5.4720us 5.7280us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=4, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.01% 10.784us 2 5.3920us 5.1200us 5.6640us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=2, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.01% 6.4640us 2 3.2320us 3.1680us 3.2960us cudnn::gemm::computeOffsetsKernel(cudnn::gemm::ComputeOffsetsParams)
0.01% 4.4160us 1 4.4160us 4.4160us 4.4160us cuInt8::ncqhw4ToNchw(char const *, char*, int, int, int, int, int, int, cuInt8::ReducedDivisorParameters)
API calls: 44.34% 30.8231s 1397 22.064ms 10.560us 748.31ms cudaMalloc
25.77% 17.9166s 1429 12.538ms 3.0080us 2.66518s cudaFree
19.50% 13.5546s 32 423.58ms 4.6720us 13.5541s cudaStreamCreateWithFlags
8.11% 5.63520s 2 2.81760s 21.280us 5.63518s cudaEventCreate
1.48% 1.03069s 656 1.5712ms 4.0000us 15.928ms cudaEventSynchronize
0.32% 223.54ms 4945 45.205us 21.440us 966.66us cudaLaunchKernel
0.09% 60.993ms 828 73.662us 14.368us 1.5800ms cudaMemcpyAsync
0.07% 49.828ms 808 61.668us 22.944us 1.6152ms cudaMemcpy
0.05% 37.075ms 256 144.82us 73.120us 874.43us cudaStreamAddCallback
0.05% 36.099ms 308 117.21us 18.656us 1.4974ms cudaMemsetAsync
0.04% 26.018ms 1716 15.161us 2.4320us 520.42us cudaEventRecord
0.02% 12.971ms 1218 10.649us 7.7440us 55.968us cudaFuncGetAttributes
0.02% 12.971ms 11 1.1791ms 5.7600us 12.812ms cudaStreamSynchronize
0.02% 10.639ms 4 2.6598ms 2.3662ms 2.9057ms cudaHostAlloc
0.01% 8.8367ms 993 8.8980us 3.0720us 85.376us cudaEventDestroy
0.01% 8.4672ms 993 8.5260us 3.2320us 371.97us cudaEventCreateWithFlags
0.01% 7.9076ms 5032 1.5710us 992ns 506.88us cudaDeviceGetAttribute
0.01% 6.4732ms 449 14.416us 4.3840us 507.62us cudaStreamDestroy
0.01% 6.0601ms 401 15.112us 8.2560us 102.14us cudaStreamCreate
0.01% 6.0570ms 5322 1.1380us 416ns 407.30us cudaGetLastError
0.01% 5.5872ms 1218 4.5870us 2.5280us 417.86us cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
0.01% 5.2562ms 15 350.41us 226.34us 625.79us cudaGetDeviceProperties
0.01% 5.0499ms 657 7.6860us 2.8480us 52.256us cudaEventElapsedTime
0.01% 4.6349ms 4 1.1587ms 1.0953ms 1.2072ms cudaFreeHost
0.01% 4.5369ms 720 6.3010us 1.8880us 784.00us cudaFuncSetAttribute
0.01% 4.3624ms 1247 3.4980us 1.4400us 534.59us cudaGetDevice
0.00% 3.4085ms 16 213.03us 4.9280us 3.1916ms cudaStreamCreateWithPriority
0.00% 3.1423ms 379 8.2910us 704ns 852.16us cuDeviceGetAttribute
0.00% 1.6859ms 6 280.98us 13.952us 1.5688ms cudaBindTexture
0.00% 1.1122ms 4 278.06us 227.52us 403.87us cuDeviceTotalMem
0.00% 770.21us 24 32.092us 4.2560us 501.22us cudaDeviceSynchronize
0.00% 607.04us 4 151.76us 9.6640us 542.98us cudaHostGetDevicePointer
0.00% 406.18us 12 33.848us 864ns 317.89us cudaRuntimeGetVersion
0.00% 309.06us 4 77.264us 1.1840us 305.34us cudaGetDeviceCount
0.00% 304.38us 1 304.38us 304.38us 304.38us cudaMemGetInfo
0.00% 90.912us 6 15.152us 2.6560us 38.688us cudaUnbindTexture
0.00% 40.000us 6 6.6660us 1.3440us 23.040us cuDeviceGetCount
0.00% 38.272us 4 9.5680us 3.2640us 27.424us cudaDeviceGetStreamPriorityRange
0.00% 18.752us 5 3.7500us 1.1840us 11.584us cuDeviceGet
0.00% 17.312us 3 5.7700us 3.4560us 9.4080us cuDriverGetVersion
0.00% 13.856us 3 4.6180us 3.9680us 5.4720us cuInit
0.00% 12.192us 4 3.0480us 2.3680us 4.5440us cuDeviceGetName
0.00% 10.240us 4 2.5600us 1.7600us 4.4160us cuDeviceGetUuid
==2330713== NVTX result:
==2330713== Thread "<unnamed>" (id = 1)
==2330713== Domain "TensorRT"
==2330713== Range "ExecutionContext::enqueue"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 277.07ms 400 692.67us 444.99us 2.6562ms ExecutionContext::enqueue
GPU activities: 33.00% 20.342ms 400 50.855us 49.056us 54.177us trt_volta_fp32_icudnn_int8x4_128x32_relu_interior_nn_v1
20.78% 12.808ms 400 32.019us 30.528us 35.232us void gemmSN_NN_kernel<float, int=256, int=4, int=2, int=8, int=4, int=4, cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>>(cublasGemmSmallNParams<float const , cublasGemvTensorStridedBatched<float const >, float>)
17.31% 10.673ms 400 26.683us 22.912us 32.192us trt_volta_int8x4_icudnn_int8x4_128x32_relu_small_nn_v1
8.63% 5.3177ms 400 13.294us 12.096us 15.456us void gemmSN_TN_kernel<float, int=128, int=16, int=2, int=4, int=4, int=4, bool=1, cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>>(cublasGemmSmallNParams<float const , cublasGemvTensorStridedBatched<float const >, float>)
4.34% 2.6772ms 400 6.6920us 5.8240us 9.1520us void cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>, cudnnTensorStruct*, int, float, cudnnTensorStruct*, int, int)
4.29% 2.6475ms 400 6.6180us 5.8880us 8.0640us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=2, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
3.73% 2.2989ms 800 2.8730us 1.9840us 5.3760us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=0, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
3.45% 2.1248ms 400 5.3120us 4.7040us 7.2960us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=5, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
2.34% 1.4401ms 400 3.6000us 3.1680us 5.1840us cuInt8::nchwToNcqhw4(float const *, unsigned int*, int, int, int, int, int, int, int, float const *, cuInt8::ReducedDivisorParameters)
2.14% 1.3175ms 400 3.2930us 3.1040us 4.4800us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=8, cudnnNanPropagation_t=1, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
API calls: 100.00% 174.32ms 4400 39.618us 21.440us 203.87us cudaLaunchKernel
==2330713== Range "conv1"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 23.534ms 400 58.835us 32.928us 276.90us conv1
GPU activities: 100.00% 10.673ms 400 26.683us 22.912us 32.192us trt_volta_int8x4_icudnn_int8x4_128x32_relu_small_nn_v1
API calls: 100.00% 17.002ms 400 42.506us 23.584us 151.58us cudaLaunchKernel
==2330713== Range "conv1 input reformatter 0"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 28.516ms 400 71.291us 40.576us 253.15us conv1 input reformatter 0
GPU activities: 100.00% 1.4401ms 400 3.6000us 3.1680us 5.1840us cuInt8::nchwToNcqhw4(float const *, unsigned int*, int, int, int, int, int, int, int, float const *, cuInt8::ReducedDivisorParameters)
API calls: 100.00% 20.209ms 400 50.522us 29.184us 178.69us cudaLaunchKernel
==2330713== Range "conv2"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 21.559ms 400 53.896us 32.480us 192.19us conv2
GPU activities: 100.00% 20.342ms 400 50.855us 49.056us 54.177us trt_volta_fp32_icudnn_int8x4_128x32_relu_interior_nn_v1
API calls: 100.00% 16.013ms 400 40.032us 23.616us 181.41us cudaLaunchKernel
==2330713== Range "ip1 + relu1"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 83.385ms 400 208.46us 139.17us 536.70us ip1 + relu1
GPU activities: 82.45% 12.808ms 400 32.019us 30.528us 35.232us void gemmSN_NN_kernel<float, int=256, int=4, int=2, int=8, int=4, int=4, cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>>(cublasGemmSmallNParams<float const , cublasGemvTensorStridedBatched<float const >, float>)
9.07% 1.4093ms 400 3.5230us 2.9120us 5.3760us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=0, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
8.48% 1.3175ms 400 3.2930us 3.1040us 4.4800us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=8, cudnnNanPropagation_t=1, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
API calls: 100.00% 46.532ms 1200 38.777us 22.880us 171.90us cudaLaunchKernel
==2330713== Range "ip2"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 45.353ms 400 113.38us 81.280us 299.84us ip2
GPU activities: 85.67% 5.3177ms 400 13.294us 12.096us 15.456us void gemmSN_TN_kernel<float, int=128, int=16, int=2, int=4, int=4, int=4, bool=1, cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>>(cublasGemmSmallNParams<float const , cublasGemvTensorStridedBatched<float const >, float>)
14.33% 889.60us 400 2.2240us 1.9840us 3.3920us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=0, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray, bool)
API calls: 100.00% 28.028ms 800 35.034us 21.440us 191.42us cudaLaunchKernel
==2330713== Range "pool1"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 21.144ms 400 52.860us 30.528us 168.48us pool1
GPU activities: 100.00% 2.6475ms 400 6.6180us 5.8880us 8.0640us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=2, int=4, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
API calls: 100.00% 16.329ms 400 40.822us 22.944us 128.03us cudaLaunchKernel
==2330713== Range "pool2"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 19.760ms 400 49.400us 30.112us 216.64us pool2
GPU activities: 100.00% 2.1248ms 400 5.3120us 4.7040us 7.2960us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=5, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
API calls: 100.00% 15.744ms 400 39.360us 22.944us 203.87us cudaLaunchKernel
==2330713== Range "prob"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 20.761ms 400 51.902us 31.104us 1.6653ms prob
GPU activities: 100.00% 2.6772ms 400 6.6920us 5.8240us 9.1520us void cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>(cudnnTensorStruct, float const *, cudnn::detail::softmax_fw_kernel<int=2, float, float, int=256, int=1, int=1, int=0>, cudnnTensorStruct*, int, float, cudnnTensorStruct*, int, int)
API calls: 100.00% 14.465ms 400 36.162us 22.528us 140.45us cudaLaunchKernel
And some people ask similar questions in forum, like
https://devtalk.nvidia.com/default/topic/1056139/general/-after-using-dla-the-speed-is-slower/
https://devtalk.nvidia.com/default/topic/1047422/general/decreased-performance-from-fp16-to-int8-in-tf-trt-on-jetson-xavier/post/5315733/#5315733
https://devtalk.nvidia.com/default/topic/1051679/jetson-agx-xavier/tensorrt-log-info-about-nvdla-/post/5338061/#5338061
The above posts are report DLA is slower than GPU, but i can’t see a clear solution or description.
Here are my questions:
- Does DLA is really slower than GPU? If yes, What is the used of DLA?
- The log above DLA is fused the nerual network layers,ex:“{conv1,pool1,conv2,pool2,ip1,relu1,ip2}”, How does it work? and Can it split?
- How to analyze or monitor the situation about DLA when I run the program in DLA. I can’t see it through nvprof?
Thanks.