Hi,
Just check nvprof within l4t-pytorch:r32.5.0-pth1.7-py3.
It works well in our environment.
Could you check it again?
$ sudo docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-pytorch:r32.5.0-pth1.7-py3
root@nvidia-desktop:/# nvprof /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx
[07/08/2021-07:27:11] [I] === Model Options ===
[07/08/2021-07:27:11] [I] Format: ONNX
[07/08/2021-07:27:11] [I] Model: /usr/src/tensorrt/data/mnist/mnist.onnx
[07/08/2021-07:27:11] [I] Output:
[07/08/2021-07:27:11] [I] === Build Options ===
[07/08/2021-07:27:11] [I] Max batch: 1
[07/08/2021-07:27:11] [I] Workspace: 16 MB
[07/08/2021-07:27:11] [I] minTiming: 1
[07/08/2021-07:27:11] [I] avgTiming: 8
[07/08/2021-07:27:11] [I] Precision: FP32
[07/08/2021-07:27:11] [I] Calibration:
[07/08/2021-07:27:11] [I] Safe mode: Disabled
[07/08/2021-07:27:11] [I] Save engine:
[07/08/2021-07:27:11] [I] Load engine:
[07/08/2021-07:27:11] [I] Builder Cache: Enabled
[07/08/2021-07:27:11] [I] NVTX verbosity: 0
[07/08/2021-07:27:11] [I] Inputs format: fp32:CHW
[07/08/2021-07:27:11] [I] Outputs format: fp32:CHW
[07/08/2021-07:27:11] [I] Input build shapes: model
[07/08/2021-07:27:11] [I] Input calibration shapes: model
[07/08/2021-07:27:11] [I] === System Options ===
[07/08/2021-07:27:11] [I] Device: 0
[07/08/2021-07:27:11] [I] DLACore:
[07/08/2021-07:27:11] [I] Plugins:
[07/08/2021-07:27:11] [I] === Inference Options ===
[07/08/2021-07:27:11] [I] Batch: 1
[07/08/2021-07:27:11] [I] Input inference shapes: model
[07/08/2021-07:27:11] [I] Iterations: 10
[07/08/2021-07:27:11] [I] Duration: 3s (+ 200ms warm up)
[07/08/2021-07:27:11] [I] Sleep time: 0ms
[07/08/2021-07:27:11] [I] Streams: 1
[07/08/2021-07:27:11] [I] ExposeDMA: Disabled
[07/08/2021-07:27:11] [I] Spin-wait: Disabled
[07/08/2021-07:27:11] [I] Multithreading: Disabled
[07/08/2021-07:27:11] [I] CUDA Graph: Disabled
[07/08/2021-07:27:11] [I] Skip inference: Disabled
[07/08/2021-07:27:11] [I] Inputs:
[07/08/2021-07:27:11] [I] === Reporting Options ===
[07/08/2021-07:27:11] [I] Verbose: Disabled
[07/08/2021-07:27:11] [I] Averages: 10 inferences
[07/08/2021-07:27:11] [I] Percentile: 99
[07/08/2021-07:27:11] [I] Dump output: Disabled
[07/08/2021-07:27:11] [I] Profile: Disabled
[07/08/2021-07:27:11] [I] Export timing to JSON file:
[07/08/2021-07:27:11] [I] Export output to JSON file:
[07/08/2021-07:27:11] [I] Export profile to JSON file:
[07/08/2021-07:27:11] [I]
==13== NVPROF is profiling process 13, command: /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx
==13== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
----------------------------------------------------------------
Input filename: /usr/src/tensorrt/data/mnist/mnist.onnx
ONNX IR version: 0.0.3
Opset version: 8
Producer name: CNTK
Producer version: 2.5.1
Domain: ai.cntk
Model version: 1
Doc string:
----------------------------------------------------------------
[07/08/2021-07:27:13] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[07/08/2021-07:27:13] [I] [TRT]
[07/08/2021-07:27:13] [I] [TRT] --------------- Layers running on DLA:
[07/08/2021-07:27:13] [I] [TRT]
[07/08/2021-07:27:13] [I] [TRT] --------------- Layers running on GPU:
[07/08/2021-07:27:13] [I] [TRT] Convolution28 + ReLU32, Pooling66, Convolution110 + ReLU114, Pooling160, Times212_reshape0, (Unnamed Layer* 0) [Constant] + Times212_reshape1, Times212, (Unnamed Layer* 16) [Constant], Plus214,
[07/08/2021-07:27:19] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[07/08/2021-07:27:19] [I] Starting inference threads
[07/08/2021-07:27:22] [I] Warmup completed 704 queries over 200 ms
[07/08/2021-07:27:22] [I] Timing trace has 14249 queries over 3.00022 s
[07/08/2021-07:27:22] [I] Trace averages of 10 runs:
[07/08/2021-07:27:22] [I] Average on 10 runs - GPU latency: 0.173114 ms - Host latency: 0.214737 ms (end to end 0.231467 ms, enqueue 0.167886 ms)
...
[07/08/2021-07:27:22] [I] Average on 10 runs - GPU latency: 0.135132 ms - Host latency: 0.166333 ms (end to end 0.184473 ms, enqueue 0.130078 ms)
[07/08/2021-07:27:22] [I] Host Latency
[07/08/2021-07:27:22] [I] min: 0.147217 ms (end to end 0.160156 ms)
[07/08/2021-07:27:22] [I] max: 1.11505 ms (end to end 1.14252 ms)
[07/08/2021-07:27:22] [I] mean: 0.175534 ms (end to end 0.189888 ms)
[07/08/2021-07:27:22] [I] median: 0.169189 ms (end to end 0.18335 ms)
[07/08/2021-07:27:22] [I] percentile: 0.237671 ms at 99% (end to end 0.255768 ms at 99%)
[07/08/2021-07:27:22] [I] throughput: 4749.33 qps
[07/08/2021-07:27:22] [I] walltime: 3.00022 s
[07/08/2021-07:27:22] [I] Enqueue Time
[07/08/2021-07:27:22] [I] min: 0.115234 ms
[07/08/2021-07:27:22] [I] max: 1.05127 ms
[07/08/2021-07:27:22] [I] median: 0.132568 ms
[07/08/2021-07:27:22] [I] GPU Compute
[07/08/2021-07:27:22] [I] min: 0.119873 ms
[07/08/2021-07:27:22] [I] max: 1.05804 ms
[07/08/2021-07:27:22] [I] mean: 0.142931 ms
[07/08/2021-07:27:22] [I] median: 0.137512 ms
[07/08/2021-07:27:22] [I] percentile: 0.195297 ms at 99%
[07/08/2021-07:27:22] [I] total compute time: 2.03662 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx
==13== Profiling application: /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx
==13== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 25.68% 95.707ms 15113 6.3320us 5.9200us 19.009us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=4, int=1, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
24.22% 90.243ms 15033 6.0020us 5.8250us 10.784us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=3, int=5, int=4, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
10.43% 38.874ms 14953 2.5990us 2.5600us 6.0810us void gemv2N_kernel<int, int, float, float, float, int=128, int=32, int=4, int=4, int=1, cublasGemvParams<cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>, float>>(float const )
9.62% 35.856ms 15337 2.3370us 1.8240us 4.9920us void nvinfer1::tiled_pooling::poolCHW_PQT<int=3, int=3, int=1, int=1, int=1, int=1, int=192, int=1, int=1, bool=0, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
8.92% 33.248ms 14985 2.2180us 2.1440us 4.3200us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=6, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
6.16% 22.967ms 14953 1.5350us 1.5040us 3.4880us void cuEltwise::eltwise<cuEltwise::SimpleAlgo<float, float>, cuEltwise::Compute<nvinfer1::ElementWiseOperation>>(cuEltwise::LaunchParams)
2.39% 8.9193ms 15131 589ns 288ns 1.5360us [CUDA memcpy HtoD]
1.81% 6.7616ms 14953 452ns 352ns 833ns [CUDA memcpy DtoH]
0.71% 2.6287ms 152 17.293us 5.5050us 32.130us void cudnn::cnn::conv2d_grouped_direct_kernel<float, float, float, float, float, float, bool=1, bool=0, int=0, int=0, int=0>(cudnnTensorStruct, float const *, cudnnFilterStruct, float const *, cudnnConvolutionStruct, cudnn::cnn::conv2d_grouped_direct_kernel<float, float, float, float, float, float, bool=1, bool=0, int=0, int=0, int=0>, float*, float, float*, cudnn::reduced_divisor, float, float, float, float, int, cudnnConvolutionStruct const *, float const *, cudnnActivationStruct)
0.46% 1.6965ms 80 21.206us 9.0240us 41.667us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=5, int=3, int=1, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.45% 1.6695ms 56 29.812us 14.305us 54.115us trt_volta_scudnn_128x128_relu_xregs_large_nn_v1
0.45% 1.6605ms 56 29.651us 14.337us 53.476us trt_volta_scudnn_128x128_relu_medium_nn_v1
0.43% 1.5870ms 64 24.796us 11.841us 42.147us trt_volta_scudnn_128x64_relu_medium_nn_v1
0.41% 1.5414ms 56 27.525us 13.537us 51.332us trt_volta_scudnn_128x128_relu_small_nn_v1
0.41% 1.5291ms 160 9.5570us 6.4960us 17.025us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=5, int=1, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.41% 1.5287ms 64 23.886us 11.809us 41.315us trt_volta_scudnn_128x64_relu_xregs_large_nn_v1
0.35% 1.3042ms 64 20.378us 10.144us 36.099us trt_volta_scudnn_128x32_relu_small_nn_v1
0.35% 1.2978ms 51 25.448us 352ns 210.35us [CUDA memset]
0.34% 1.2791ms 64 19.985us 10.785us 35.939us trt_volta_scudnn_128x64_relu_small_nn_v1
0.27% 997.04us 80 12.462us 6.8480us 23.074us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=3, int=5, int=1, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.26% 966.28us 40 24.156us 22.049us 32.418us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=5, int=3, int=2, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.25% 941.51us 80 11.768us 7.9360us 21.858us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=5, int=5, int=1, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.21% 799.61us 32 24.987us 19.969us 30.210us void gemv2N_kernel<int, int, float2, float2, float2, int=128, int=8, int=4, int=4, int=1, cublasGemvParams<cublasGemvTensorStridedBatched<float2 const >, cublasGemvTensorStridedBatched<float2>, float2>>(float2 const )
0.21% 797.11us 80 9.9630us 9.5690us 13.761us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=5, int=2, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.21% 773.18us 80 9.6640us 6.9760us 17.057us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=5, int=2, int=1, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.20% 752.66us 80 9.4080us 6.7840us 15.105us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=4, int=1, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.20% 729.40us 48 15.195us 8.0000us 31.554us trt_volta_scudnn_128x32_relu_medium_nn_v1
0.19% 714.89us 40 17.872us 16.481us 24.066us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=5, int=2, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.19% 698.74us 80 8.7340us 8.3520us 11.425us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=3, int=5, int=8, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.18% 666.35us 80 8.3290us 6.6880us 13.409us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=3, int=4, int=1, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.17% 649.46us 184 3.5290us 3.1040us 4.8640us void op_generic_tensor_kernel<int=3, float, float, float, int=256, cudnnGenericOp_t=0, cudnnNanPropagation_t=0, int=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, reducedDivisorArray, int)
0.17% 628.91us 80 7.8610us 6.2410us 11.584us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=2, int=4, int=1, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.17% 619.38us 184 3.3660us 3.0410us 4.0970us void op_generic_tensor_kernel<int=1, float, float, float, int=256, cudnnGenericOp_t=8, cudnnNanPropagation_t=1, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, reducedDivisorArray, int)
0.16% 603.50us 40 15.087us 14.529us 19.009us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=3, int=4, int=8, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.14% 519.94us 40 12.998us 12.609us 17.057us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=3, int=5, int=8, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.13% 470.46us 40 11.761us 10.720us 18.945us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=6, int=2, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.11% 405.50us 40 10.137us 9.7610us 15.938us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=5, int=5, int=2, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.11% 393.88us 40 9.8470us 9.5050us 11.841us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=3, int=2, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.10% 372.86us 40 9.3210us 9.0250us 11.745us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=4, int=8, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.10% 371.55us 40 9.2880us 8.9610us 14.529us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=6, int=4, int=2, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.10% 368.57us 40 9.2140us 8.7370us 11.904us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=4, int=2, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.10% 365.46us 40 9.1360us 8.8320us 10.657us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=3, int=8, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.10% 363.90us 40 9.0970us 8.7370us 11.969us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=6, int=5, int=8, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.10% 356.05us 40 8.9010us 8.6090us 11.712us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=5, int=6, int=8, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.09% 353.14us 40 8.8280us 8.4490us 11.392us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=5, int=3, int=8, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.09% 352.41us 40 8.8100us 7.9690us 13.281us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=3, int=5, int=4, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.09% 344.98us 40 8.6240us 8.2880us 10.337us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=4, int=4, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.09% 342.55us 40 8.5630us 8.2570us 12.417us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=3, int=5, int=2, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.09% 341.69us 40 8.5420us 8.2250us 13.761us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=5, int=4, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.09% 340.28us 40 8.5060us 8.2560us 13.537us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=5, int=4, int=2, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.09% 335.32us 40 8.3820us 7.9040us 11.393us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=5, int=4, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.09% 335.13us 40 8.3780us 7.9370us 12.833us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=5, int=5, int=4, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.09% 324.92us 40 8.1230us 7.7440us 11.873us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=6, int=4, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.08% 306.87us 40 7.6710us 7.3290us 10.369us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=6, int=8, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.08% 304.34us 32 9.5100us 4.8640us 15.553us void fft2d_r2c_32x32<float, bool=0, unsigned int=5, bool=1>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
0.08% 290.67us 40 7.2660us 6.6560us 10.656us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=5, int=4, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.07% 271.76us 40 6.7940us 6.4640us 8.2890us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=3, int=4, int=8, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.07% 267.54us 40 6.6880us 6.3690us 9.0560us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=4, int=4, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.07% 260.59us 16 16.287us 15.201us 18.049us void fft2d_r2c_32x32<float, bool=0, unsigned int=1, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
0.07% 255.28us 40 6.3820us 6.0160us 8.7370us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=4, int=8, int=5, int=5, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.06% 206.80us 16 12.925us 12.481us 13.217us void fft2d_c2r_32x32<float, bool=0, bool=0, unsigned int=1, bool=0, bool=0>(float*, float2 const *, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*, int2, int, int)
0.05% 183.18us 64 2.8620us 2.4640us 8.3200us void pooling_fw_4d_kernel<float, float, cudnn::maxpooling_func<float, cudnnNanPropagation_t=0>, int=0, bool=0>(cudnnTensorStruct, float const *, cudnnTensorStruct, float*, cudnnPoolingStruct, float, float, int, cudnn::reduced_divisor, cudnn::reduced_divisor)
0.04% 136.97us 64 2.1400us 1.9520us 4.3530us void CUTENSOR_NAMESPACE::tensor_elementwise_kernel<CUTENSOR_NAMESPACE::pw_config_t<unsigned int=1, int=256, unsigned int=64, unsigned int=1, unsigned int=0, unsigned int=1, unsigned int=2, unsigned int=0, unsigned int=1, unsigned int=2>, float, float, float, float, bool=1, cutensorOperator_t=1, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t>(CUTENSOR_NAMESPACE::pw_params_t, int, int, unsigned int=0, int=256 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=64 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=1 const *, unsigned int=64 const **, cutensorOperator_t, void const *, cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const )
0.03% 93.960us 16 5.8720us 5.4720us 11.169us void fft2d_c2r_32x32<float, bool=0, bool=0, unsigned int=0, bool=0, bool=0>(float*, float2 const *, int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float*, float*, int2, int, int)
0.02% 87.590us 32 2.7370us 2.3360us 4.2570us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=7, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 83.044us 64 1.2970us 1.1200us 2.3370us [CUDA memcpy DtoD]
0.02% 82.275us 32 2.5710us 2.2720us 4.3200us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=7, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 79.400us 32 2.4810us 2.2720us 4.7370us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=8, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 78.151us 32 2.4420us 2.2410us 4.4170us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=8, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 77.703us 32 2.4280us 2.2400us 3.3280us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=1, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 77.158us 16 4.8220us 4.1600us 11.201us void fft2d_r2c_32x32<float, bool=0, unsigned int=0, bool=0>(float2*, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
0.02% 76.741us 32 2.3980us 2.2400us 4.2880us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=5, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 76.355us 32 2.3860us 2.2400us 3.8400us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=5, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 75.720us 32 2.3660us 1.9840us 3.8080us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=4, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 75.717us 32 2.3660us 2.1120us 3.8400us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=1, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 75.461us 32 2.3580us 1.9840us 3.4880us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=3, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 75.426us 32 2.3570us 2.1760us 3.5520us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=6, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 75.367us 32 2.3550us 2.0480us 3.3600us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=4, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 74.565us 32 2.3300us 1.9840us 3.5840us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=32, int=256, int=2, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 73.764us 32 2.3050us 1.9200us 3.3920us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=3, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.02% 71.654us 32 2.2390us 1.8250us 3.4560us void nvinfer1::tiled_pooling::poolCHW_PQT<int=2, int=2, int=2, int=2, int=2, int=16, int=128, int=2, int=1, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.01% 42.308us 16 2.6440us 2.0480us 5.7600us cask_trt::computeOffsetsKernel(cask_trt::ComputeOffsetsParams)
API calls: 26.28% 2.64881s 239 11.083ms 1.4400us 1.62312s cudaFree
24.83% 2.50279s 94078 26.603us 12.512us 1.05240s cudaLaunchKernel
15.86% 1.59868s 18833 84.887us 1.5040us 1.2079ms cudaEventSynchronize
12.42% 1.25163s 24 52.151ms 2.3040us 1.25152s cudaStreamCreateWithFlags
7.91% 796.93ms 176 4.5280ms 10.113us 62.489ms cuModuleUnload
4.37% 440.12ms 112755 3.9030us 1.2800us 726.21us cudaEventRecord
3.62% 364.81ms 30110 12.116us 9.0880us 88.420us cudaMemcpyAsync
1.76% 177.53ms 108551 1.6350us 1.0880us 962.42us cudaEventElapsedTime
1.16% 116.85ms 3880 30.116us 14.017us 349.11us cudaStreamAddCallback
0.70% 70.322ms 30451 2.3090us 1.5360us 51.011us cudaStreamWaitEvent
0.54% 53.944ms 112959 477ns 224ns 775.43us cudaGetLastError
0.37% 37.773ms 14953 2.5260us 1.7280us 45.026us cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
0.06% 6.3658ms 214 29.746us 6.9120us 370.51us cudaMalloc
0.02% 1.6867ms 51 33.072us 7.6480us 82.916us cudaMemsetAsync
0.01% 1.1697ms 540 2.1660us 1.2160us 23.873us cudaFuncSetAttribute
0.01% 1.1666ms 32 36.454us 29.665us 60.611us cudaMemcpy2DAsync
0.01% 753.93us 57 13.226us 8.1280us 34.017us cudaCreateTextureObject
0.01% 744.84us 241 3.0900us 704ns 67.043us cudaDeviceGetAttribute
0.01% 697.48us 473 1.4740us 512ns 65.123us cuDeviceGetAttribute
0.01% 666.72us 113 5.9000us 4.4170us 23.905us cudaStreamSynchronize
0.01% 649.34us 3 216.45us 76.740us 366.48us cudaHostAlloc
0.01% 580.73us 204 2.8460us 1.0880us 48.642us cudaEventDestroy
0.01% 566.17us 172 3.2910us 1.1520us 32.994us cudaEventCreateWithFlags
0.01% 520.22us 5 104.04us 27.617us 221.20us cudaFreeHost
0.00% 472.25us 57 8.2850us 3.8400us 27.362us cudaDestroyTextureObject
0.00% 469.49us 11 42.681us 19.777us 90.308us cudaGetDeviceProperties
0.00% 429.43us 7 61.347us 1.1200us 96.741us cudaMemcpy
0.00% 274.51us 12 22.875us 2.2720us 243.40us cudaStreamCreateWithPriority
0.00% 235.72us 40 5.8930us 2.3680us 45.122us cudaStreamDestroy
0.00% 154.95us 24 6.4560us 2.9760us 14.113us cudaDeviceSynchronize
0.00% 148.20us 32 4.6310us 3.3920us 9.1840us cudaEventCreate
0.00% 129.80us 3 43.265us 8.3840us 66.435us cudaHostGetDevicePointer
0.00% 114.98us 23 4.9990us 1.0880us 18.209us cudaGetDevice
0.00% 91.109us 1 91.109us 91.109us 91.109us cudaLaunchHostFunc
0.00% 83.621us 5 16.724us 8.6410us 23.137us cuDeviceTotalMem
0.00% 73.828us 4 18.457us 5.5050us 45.474us cudaStreamCreate
0.00% 71.299us 2 35.649us 28.129us 43.170us cudaSetDevice
0.00% 70.308us 2 35.154us 27.361us 42.947us cudaMallocHost
0.00% 67.683us 57 1.1870us 384ns 22.562us cudaCreateChannelDesc
0.00% 33.505us 4 8.3760us 7.2000us 10.016us cuInit
0.00% 23.042us 4 5.7600us 2.7520us 11.233us cuDriverGetVersion
0.00% 17.985us 10 1.7980us 480ns 9.1200us cudaGetDeviceCount
0.00% 13.760us 7 1.9650us 1.2800us 3.2640us cuDeviceGetCount
0.00% 8.8330us 5 1.7660us 1.2800us 2.4640us cuDeviceGetName
0.00% 8.4480us 3 2.8160us 2.0800us 4.2560us cudaDeviceGetStreamPriorityRange
0.00% 7.8080us 6 1.3010us 928ns 1.9840us cuDeviceGet
0.00% 7.0720us 2 3.5360us 3.5200us 3.5520us cuDevicePrimaryCtxRelease
0.00% 4.9920us 5 998ns 832ns 1.3120us cuDeviceGetUuid
0.00% 4.9280us 7 704ns 512ns 960ns cudaRuntimeGetVersion
Thanks.