Application returned non-zero code 12

I have been running profiler on my custom app and getting following error …application is working fine without profiler
Can you please help me with profiling…
nvprof ./build/deepstream-app

New Constructor Single
With tracker : Devices : 1
==18652== NVPROF is profiling process 18652, command: ./build/deepstream-app
Source Bun Number : 0
Source Bi Name : source-bin-00
Sourcebin err 1?
Sourcebin err 3?
Link E: stream-muxer with : queue0
Link E: queue0 with : tee0
No Config Found: Setting default value for
No Config Found: Setting default value for
Setting nvtiler
Added nvtiler
All Elements created
Obtained request pad src_0 from tee0 for branch i = 0 inferIndex = 0 .
Obtained static pad queue1 for branch i = 0.
tee pads linked
Link Passed : queue1 with : nvinfer0
Link Passed : nvinfer0 with : queue2
Link Passed : queue2 with : nvtracker0
Link Passed : nvtracker0 with : queue3
Link Passed : queue3 with : nvmultistreamtiler0
Link Passed : nvmultistreamtiler0 with : queue4
Link Passed : queue4 with : nvvideoconvert0
Link Passed : nvvideoconvert0 with : queue5
Link Passed : queue5 with : nvdsosd0
Link Passed : nvdsosd0 with : queue6
Link Passed : queue6 with : fpsdisplaysink0
PRobe 1
PRobe 2
Probe : 1
Now playing:
gstnvtracker: Loading low-level lib at …/custom_tracker_parser/build/libnvds_customtracker.so
gstnvtracker: Optional NvMOT_RemoveStreams not implemented
gstnvtracker: Batch processing is ON
gstnvtracker: Past frame output is OFF
NvMOT_ConfigFilePathcamera2:yolov5:custom_tracker
NvMOT_streams: 1
pConfigIn->miscConfig.maxObjPerStream: 0
camera2
yolov5
No Config Found: Setting default value for
No Config Found: Setting default value for
Successful Init call
0:00:00.338102522 18652 0x558c85599c00 INFO nvinfer gstnvinfer.cpp:619:gst_nvinfer_logger: NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1715> [UID = 1]: Trying to create engine from model files

Input filename: /home/salini/Downloads/yolov5s.onnx
ONNX IR version: 0.0.6
Opset version: 12
Producer name: pytorch
Producer version: 1.8
Domain:
Model version: 0
Doc string:

WARNING: …/nvdsinfer/nvdsinfer_func_utils.cpp:36 [TRT]: onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
WARNING: …/nvdsinfer/nvdsinfer_func_utils.cpp:36 [TRT]: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
INFO: …/nvdsinfer/nvdsinfer_func_utils.cpp:39 [TRT]: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
INFO: …/nvdsinfer/nvdsinfer_func_utils.cpp:39 [TRT]: Detected 1 inputs and 4 output network tensors.
0:00:19.922347228 18652 0x558c85599c00 INFO nvinfer gstnvinfer.cpp:619:gst_nvinfer_logger: NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1748> [UID = 1]: serialize cuda engine to file: /home/salini/Downloads/yolov5s.onnx_b1_gpu0_fp32.engine successfully
INFO: …/nvdsinfer/nvdsinfer_model_builder.cpp:685 [Implicit Engine Info]: layers num: 2
0 INPUT kFLOAT images 3x640x640
1 OUTPUT kFLOAT output 25200x85

0:00:19.929843098 18652 0x558c85599c00 INFO nvinfer gstnvinfer_impl.cpp:313:notifyLoadModelStatus: [UID 1]: Load new model:…/config/deepstream_infer_config.txt sucessfully
Decodebin child added: source
Decodebin child added: decodebin0
Decodebin child added: qtdemux0
Decodebin child added: multiqueue0
Decodebin child added: h264parse0
Decodebin child added: capsfilter0
Decodebin child added: aacparse0
Missing definition of the OpenACC API routine/s in the OpenACC library linked to the application. To work around this issue either force the inclusion of all the OpenACC symbols in the binary or link the OpenACC library dynamically.
==18652== Profiling application: ./build/deepstream-app
==18652== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 35.20% 1.00027s 1901 526.18us 896ns 2.0119ms [CUDA memset]
3.32% 94.439ms 1050 89.941us 19.807us 1.2096ms void gemv2N_kernel<int, int, float2, float2, float2, int=128, int=8, int=4, int=4, int=1, cublasGemvParams<cublasGemvTensorStridedBatched, cublasGemvTensorStridedBatched, float2>>(float2 const )
2.35% 66.761ms 3625 18.416us 543ns 383.70us [CUDA memcpy HtoD]
2.05% 58.184ms 3776 15.408us 1.7600us 167.16us generatedNativePointwise
1.42% 40.331ms 80 504.14us 77.342us 1.2812ms void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=7, int=1, int=1, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
1.02% 28.982ms 160 181.14us 50.143us 366.68us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=5, int=5, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
1.01% 28.708ms 156 184.02us 3.2310us 1.9541ms void genericReformat::copyPackedKernel<float, float, bool=1, bool=1, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const , genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const , void, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
0.95% 27.043ms 160 169.02us 45.151us 383.38us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=7, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.92% 26.174ms 160 163.59us 53.503us 319.64us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=7, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.90% 25.490ms 470 54.233us 19.743us 371.96us void cuPointwise::launchPointwise<cuPointwise::StripMineAlgo<float, float, int=32>>(cuPointwise::LaunchParams, nvinfer1::VirtualMachineProgram)
0.89% 25.235ms 160 157.72us 48.255us 325.62us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=3, int=10, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.87% 24.806ms 160 155.04us 43.967us 299.23us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=7, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.79% 22.437ms 130 172.59us 60.287us 377.50us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=10, int=1, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.78% 22.074ms 103 214.31us 33.855us 909.32us trt_volta_scudnn_128x128_relu_small_nn_v1
0.76% 21.608ms 101 213.94us 33.919us 911.95us trt_volta_scudnn_128x128_relu_medium_nn_v1
0.71% 20.229ms 80 252.87us 62.846us 532.50us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=7, int=2, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.71% 20.201ms 80 252.51us 65.439us 511.86us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=1, int=7, int=1, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.70% 19.817ms 102 194.28us 24.128us 685.90us trt_volta_scudnn_128x32_relu_small_nn_v1
0.67% 19.061ms 80 238.26us 72.990us 576.59us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=7, int=4, int=1, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.67% 18.973ms 80 237.16us 64.926us 467.35us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=8, int=2, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.66% 18.756ms 130 144.28us 43.871us 302.46us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=3, int=7, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.64% 18.245ms 130 140.35us 41.631us 333.18us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=8, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.63% 17.910ms 103 173.88us 22.687us 589.20us trt_volta_scudnn_128x32_relu_medium_nn_v1
0.60% 17.100ms 102 167.64us 24.192us 621.07us trt_volta_scudnn_128x64_relu_medium_nn_v1
0.60% 16.924ms 1050 16.118us 10.623us 53.567us void fft2d_r2c_32x32<float, bool=0, unsigned int=0, bool=0>(float2
, float const , int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
0.59% 16.872ms 472 35.745us 11.071us 238.30us void cuPointwise::launchPointwise<cuPointwise::StripMineAlgo<float, float, int=64>>(cuPointwise::LaunchParams, nvinfer1::VirtualMachineProgram)
0.59% 16.833ms 30 561.11us 20.543us 1.8067ms void fft2d_r2c_32x32<float, bool=0, unsigned int=1, bool=1>(float2
, float const *, int, int, int, int, int, int, int, int, int, cudnn::reduced_divisor, bool, int2, int, int)
0.59% 16.773ms 103 162.85us 23.999us 600.08us trt_volta_scudnn_128x64_relu_small_nn_v1
0.59% 16.713ms 65 257.13us 76.926us 693.74us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=5, int=5, int=1, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.59% 16.680ms 80 208.50us 63.230us 427.93us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=1, int=8, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.59% 16.630ms 37 449.46us 195.00us 1.1155ms trt_volta_scudnn_128x128_relu_xregs_large_nn_v1
0.57% 16.213ms 472 34.350us 5.0240us 281.72us void cuPointwise::launchPointwise<cuPointwise::SimpleAlgo<float, float, int=512>>(cuPointwise::LaunchParams, nvinfer1::VirtualMachineProgram)
0.54% 15.380ms 80 192.25us 52.287us 390.58us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=5, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.54% 15.227ms 80 190.34us 54.879us 367.03us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=2, int=7, int=1, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.53% 15.159ms 80 189.49us 56.990us 374.23us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=2, int=7, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.53% 14.998ms 127 118.09us 17.280us 675.54us volta_gcgemm_32x32_nt
0.52% 14.889ms 80 186.12us 50.751us 364.70us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=4, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.52% 14.870ms 80 185.88us 62.111us 375.67us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=3, int=10, int=1, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.52% 14.839ms 472 31.438us 4.8000us 250.49us void cuPointwise::launchPointwise<cuPointwise::SimpleAlgo<float, float, int=256>>(cuPointwise::LaunchParams, nvinfer1::VirtualMachineProgram)
0.52% 14.818ms 472 31.393us 6.7830us 235.45us void cuPointwise::launchPointwise<cuPointwise::StripMineAlgo<float, float, int=128>>(cuPointwise::LaunchParams, nvinfer1::VirtualMachineProgram)
0.52% 14.780ms 80 184.75us 57.726us 371.99us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=8, int=1, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.52% 14.765ms 472 31.281us 4.6080us 245.88us void cuPointwise::launchPointwise<cuPointwise::SimpleAlgo<float, float, int=128>>(cuPointwise::LaunchParams, nvinfer1::VirtualMachineProgram)
0.52% 14.705ms 80 183.81us 51.871us 361.88us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=2, int=7, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.50% 14.127ms 67 210.84us 39.103us 705.07us void implicit_convolve_sgemm<float, float, int=128, int=5, int=5, int=3, int=3, int=3, int=1, bool=0, bool=1, bool=1>(int, int, int, float const , int, float, float const *, kernel_conv_params, __int64, int, float, float, int, float const *, float const *, bool, int, int)
0.49% 13.898ms 80 173.73us 54.111us 364.31us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=5, int=5, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.49% 13.876ms 80 173.45us 56.639us 367.90us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=3, int=9, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.49% 13.861ms 80 173.26us 59.103us 351.29us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=2, int=8, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.49% 13.787ms 80 172.33us 48.127us 364.47us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=3, int=10, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.48% 13.502ms 80 168.77us 45.182us 331.90us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=7, int=4, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.46% 12.942ms 80 161.77us 48.159us 308.50us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=8, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.45% 12.786ms 80 159.83us 44.095us 310.04us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=4, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=7, int=4, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=4, int=1Type>)
0.44% 12.624ms 80 157.80us 46.687us 325.05us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=3, int=10, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.44% 12.537ms 80 156.71us 48.927us 298.87us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=8, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.44% 12.536ms 65 192.87us 59.327us 361.11us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=1, int=7, int=1, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.44% 12.415ms 65 191.00us 63.518us 409.81us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=5, int=5, int=1, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.44% 12.395ms 65 190.70us 70.783us 390.01us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=6, int=5, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.44% 12.367ms 40 309.17us 167.77us 466.07us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=7, int=5, int=1, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.42% 11.992ms 65 184.50us 61.247us 393.11us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=3, int=8, int=1, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.41% 11.732ms 65 180.49us 43.039us 408.09us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=3, int=10, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.41% 11.718ms 40 292.96us 154.56us 437.65us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=8, int=1, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.41% 11.667ms 30 388.91us 146.20us 602.83us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=7, int=4, int=1, int=3, int=3, int=2, int=2>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.41% 11.637ms 65 179.03us 46.815us 332.73us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=8, int=2, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.41% 11.619ms 65 178.75us 60.191us 372.76us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=9, int=1, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.40% 11.479ms 300 38.264us 16.640us 254.71us void nvinfer1::tiled_pooling::poolCHW_PQT<int=5, int=5, int=1, int=1, int=1, int=1, int=192, int=1, int=1, bool=0, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.40% 11.381ms 39 291.83us 121.18us 578.39us trt_volta_scudnn_128x64_relu_xregs_large_nn_v1
0.40% 11.343ms 163 69.589us 7.9040us 980.42us void CUTENSOR_NAMESPACE::vectorized_tensor_elementwise_kernel<CUTENSOR_NAMESPACE::pw_config_t, float, float, float, float, bool=1, cutensorOperator_t=1, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t>(CUTENSOR_NAMESPACE::pw_params_t, int, int, unsigned int=0, int=256 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=64 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=1 const *, unsigned int=64 const *, cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const )
0.40% 11.277ms 65 173.50us 59.903us 361.37us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=4, int=1, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.40% 11.252ms 65 173.11us 62.078us 369.24us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=3, int=8, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.38% 10.935ms 95 115.11us 40.575us 303.96us void cudnn::cnn::im2col4d_kernel<float, long>(cudnn::cnn::im2col4d_params, cudnnConvolutionStruct, cudnnTensor4dStruct, float const , cudnnTensor4dStruct)
0.38% 10.725ms 65 165.00us 40.799us 292.83us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=7, int=2, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.37% 10.634ms 65 163.59us 49.342us 300.31us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=7, int=2, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.37% 10.624ms 65 163.45us 56.703us 354.26us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=7, int=1, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.37% 10.503ms 65 161.58us 57.726us 329.02us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=1, int=10, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.36% 10.132ms 151 67.098us 5.6960us 944.84us void genericReformat::copyPackedKernel<float, float, bool=0, bool=1, genericReformat::IdentityCoordMapper<int=4>, int=4>(unsigned int, unsigned int, void const , genericReformat::ArrayN<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::IdentityCoordMapper<int=4>>, genericReformat::ArrayN, int, int, int, float const , void, genericReformat::ArrayN, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayN, int, int, int, float const , int=4)
0.35% 10.026ms 65 154.25us 44.350us 395.99us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=4, int=5, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.34% 9.7748ms 65 150.38us 48.831us 332.54us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=6, int=5, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.34% 9.6878ms 65 149.04us 52.990us 298.65us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=9, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.34% 9.5538ms 65 146.98us 42.879us 283.00us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=4, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.33% 9.3041ms 65 143.14us 48.671us 325.91us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=3, int=10, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.32% 9.1827ms 65 141.27us 49.631us 249.98us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=7, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.32% 9.1622ms 65 140.96us 47.711us 251.35us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=4, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.32% 9.0482ms 65 139.20us 46.239us 324.44us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=3, int=7, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.31% 8.9225ms 65 137.27us 40.767us 295.16us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=9, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.31% 8.9092ms 537 16.590us 9.5680us 28.959us void fft2d_c2r_32x32<float, bool=1, bool=0, unsigned int=0, bool=0, bool=0>(float
, float2 const , int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float, float
, int2, int, int)
0.31% 8.8288ms 65 135.83us 47.870us 237.43us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=8, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.31% 8.8132ms 65 135.59us 48.127us 290.91us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=3, int=10, int=2, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.31% 8.8107ms 65 135.55us 38.815us 256.92us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=3, int=7, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.31% 8.7910ms 65 135.25us 43.359us 241.05us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=3, int=9, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.31% 8.7359ms 400 21.839us 2.8480us 124.19us void op_generic_tensor_kernel<int=3, float, float, float, int=256, cudnnGenericOp_t=0, cudnnNanPropagation_t=0, int=0>(cudnnTensorStruct, float
, cudnnTensorStruct, float const , cudnnTensorStruct, float const , float, float, float, float, reducedDivisorArray, int)
0.31% 8.7067ms 65 133.95us 46.622us 258.46us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=4, int=7, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.30% 8.5957ms 65 132.24us 38.367us 235.16us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=7, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.30% 8.4504ms 510 16.569us 10.336us 22.048us void fft2d_c2r_32x32<float, bool=0, bool=0, unsigned int=0, bool=0, bool=0>(float
, float2 const , int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float, float
, int2, int, int)
0.29% 8.3737ms 65 128.82us 41.983us 258.49us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=2, int=7, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.29% 8.2903ms 65 127.54us 34.271us 253.59us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=7, int=4, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.29% 8.1103ms 65 124.77us 39.039us 225.92us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=2, int=8, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.28% 8.0960ms 65 124.55us 33.631us 246.52us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=7, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.28% 7.8767ms 20 393.84us 244.09us 570.93us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=4, int=5, int=1, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.27% 7.7408ms 172 45.004us 5.3440us 229.37us void CUTENSOR_NAMESPACE::vectorized_tensor_elementwise_kernel<CUTENSOR_NAMESPACE::pw_config_t, float, float, float, float, bool=1, cutensorOperator_t=1, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t>(CUTENSOR_NAMESPACE::pw_params_t, int, int, unsigned int=1, int=32 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=256 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=1 const *, unsigned int=256 const **, cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const )
0.27% 7.5909ms 65 116.78us 34.560us 227.00us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=7, int=4, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.26% 7.4060ms 65 113.94us 34.016us 215.64us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=7, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.25% 7.2003ms 55 130.91us 17.056us 713.36us void fft1d_r2c_32<float, float, float2, bool=1, bool=0>(float2
, float const *, int, int3, int3, int2, int2)
0.24% 6.7590ms 23 293.87us 54.591us 668.91us void explicit_convolve_sgemm<float, int, int=128, int=5, int=5, int=3, int=3, int=3, int=0, bool=0>(int, int, int, float const *, int, float const , int, float, kernel_conv_params, __int64, int, __int64, int, float, float, int, float const *, float const *)
0.23% 6.6365ms 40 165.91us 116.96us 225.69us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=7, int=8, int=4, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.22% 6.2365ms 20 311.82us 214.01us 542.23us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=4, int=5, int=2, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.22% 6.1545ms 20 307.73us 167.16us 458.01us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=5, int=7, int=1, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.21% 6.0904ms 20 304.52us 175.64us 390.97us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=5, int=5, int=1, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.21% 6.0847ms 20 304.23us 192.70us 565.30us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=5, int=7, int=4, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.21% 6.0639ms 64 94.748us 24.031us 268.31us trt_volta_scudnn_128x32_relu_interior_nn_v1
0.21% 6.0164ms 64 94.005us 33.280us 244.99us trt_volta_scudnn_128x128_relu_interior_nn_v1
0.21% 5.9970ms 20 299.85us 164.73us 525.33us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=2, int=7, int=2, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.20% 5.8090ms 20 290.45us 174.30us 530.77us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=2, int=5, int=2, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.19% 5.5174ms 20 275.87us 151.55us 404.09us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=7, int=1, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.19% 5.4652ms 20 273.26us 148.77us 395.03us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=7, int=4, int=1, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.19% 5.4584ms 20 272.92us 157.92us 359.26us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=7, int=3, int=1, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.19% 5.2711ms 33 159.73us 32.223us 683.76us void explicit_convolve_sgemm<float, int, int=1024, int=5, int=5, int=3, int=3, int=3, int=0, bool=0>(int, int, int, float const *, int, float const , int, float, kernel_conv_params, __int64, int, __int64, int, float, float, int, float const *, float const *)
0.18% 5.1333ms 64 80.208us 22.399us 251.39us trt_volta_scudnn_128x64_relu_interior_nn_v1
0.18% 5.0283ms 20 251.41us 145.47us 371.61us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=5, int=7, int=2, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.17% 4.9720ms 20 248.60us 157.88us 454.61us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=5, int=7, int=8, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.17% 4.8825ms 9 542.50us 428.73us 729.90us void implicit_convolve_sgemm<float, float, int=512, int=6, int=8, int=3, int=3, int=5, int=1, bool=0, bool=1, bool=1>(int, int, int, float const , int, float, float const *, kernel_conv_params, __int64, int, float, float, int, float const *, float const , bool, int, int)
0.17% 4.8074ms 127 37.853us 19.103us 78.462us void fft1d_r2c_32<float, float, float2, bool=0, bool=0>(float2
, float const *, int, int3, int3, int2, int2)
0.17% 4.7043ms 20 235.22us 136.99us 341.02us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=8, int=2, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.17% 4.6989ms 20 234.95us 165.92us 312.47us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=7, int=5, int=1, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.16% 4.6732ms 9 519.25us 433.33us 689.10us void explicit_convolve_sgemm<float, int, int=512, int=6, int=8, int=3, int=3, int=5, int=0, bool=0>(int, int, int, float const *, int, float const , int, float, kernel_conv_params, __int64, int, __int64, int, float, float, int, float const *, float const *)
0.16% 4.6149ms 20 230.75us 131.74us 318.94us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=6, int=5, int=2, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.15% 4.2583ms 20 212.91us 144.06us 270.62us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=7, int=4, int=1, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.15% 4.2083ms 20 210.42us 157.95us 296.79us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=7, int=8, int=8, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.14% 4.0549ms 20 202.75us 159.74us 274.23us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=7, int=7, int=8, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.14% 4.0011ms 20 200.06us 129.76us 270.30us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=6, int=8, int=2, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.14% 3.9895ms 20 199.47us 132.48us 337.24us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=3, int=7, int=8, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.14% 3.9887ms 20 199.44us 131.61us 276.41us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=7, int=7, int=2, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.14% 3.9364ms 555 7.0920us 1.1520us 60.574us [CUDA memcpy DtoD]
0.14% 3.9292ms 20 196.46us 142.01us 242.59us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=7, int=1, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.14% 3.9236ms 20 196.18us 124.93us 284.67us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=5, int=7, int=4, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.13% 3.8079ms 20 190.40us 132.76us 240.99us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=6, int=8, int=8, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)
0.13% 3.7236ms 20 186.18us 129.57us 247.48us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=4, int=7, int=2, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.13% 3.7056ms 20 185.28us 119.07us 279.26us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=8, int=4, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.13% 3.6973ms 20 184.86us 123.17us 313.05us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=7, int=4, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.13% 3.6244ms 20 181.22us 130.43us 221.66us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=8, int=5, int=8, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.13% 3.6027ms 20 180.13us 128.89us 227.64us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=8, int=5, int=2, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.13% 3.6004ms 9 400.04us 319.64us 548.60us volta_scudnn_128x128_relu_small_nn_v1
0.13% 3.5880ms 20 179.40us 135.10us 216.86us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=7, int=6, int=8, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.13% 3.5679ms 20 178.40us 128.03us 224.54us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=5, int=7, int=2, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.12% 3.5166ms 20 175.83us 120.57us 216.86us void implicit_convolve_sgemm<float, float, int=128, int=6, int=7, int=3, int=3, int=5, int=1, bool=0, bool=1, bool=1>(int, int, int, float const , int, float, float const *, kernel_conv_params, __int64, int, float, float, int, float const *, float const *, bool, int, int)
0.12% 3.4193ms 330 10.361us 672ns 467.22us [CUDA memcpy DtoH]
0.12% 3.3998ms 20 169.99us 122.24us 204.35us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=5, int=8, int=8, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.12% 3.3637ms 20 168.18us 117.95us 226.97us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=8, int=7, int=4, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.12% 3.3309ms 20 166.54us 116.41us 199.96us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=8, int=5, int=4, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.12% 3.2976ms 20 164.88us 124.77us 200.32us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=7, int=7, int=4, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.11% 3.2623ms 19 171.70us 90.174us 381.78us volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1
0.11% 3.2162ms 19 169.27us 88.606us 383.86us trt_volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1
0.11% 3.2058ms 27 118.73us 53.791us 168.76us void explicit_convolve_sgemm<float, int, int=128, int=6, int=7, int=3, int=3, int=5, int=0, bool=0>(int, int, int, float const *, int, float const , int, float, kernel_conv_params, __int64, int, __int64, int, float, float, int, float const , float const )
0.11% 3.1561ms 20 157.81us 116.06us 188.99us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1, int=1>, float, float, int=5, int=7, int=4, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
0.08% 2.1980ms 127 17.307us 5.5360us 58.623us void fft1d_c2r_32<float2, float, float, bool=0, bool=1, bool=0, bool=0>(float
, float2 const , int, int3, int3, int2, int, float, float, float, float
)
0.07% 1.9622ms 84 23.359us 17.311us 32.543us void nvinfer1::tiled_pooling::poolCHW_PQT<int=5, int=5, int=1, int=1, int=1, int=1, int=1024, int=1, int=1, bool=0, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams, int)
0.07% 1.8902ms 7 270.03us 219.45us 337.24us volta_scudnn_128x32_sliced1x4_ldg4_relu_exp_small_nhwc_tn_v1
0.07% 1.8876ms 20 94.380us 70.494us 137.25us volta_scudnn_128x64_relu_interior_nn_v1
0.06% 1.7484ms 7 249.77us 169.79us 355.22us volta_scudnn_128x64_relu_small_nn_v1
0.06% 1.7066ms 11 155.14us 38.527us 275.80us volta_scudnn_128x32_relu_small_nn_v1
0.06% 1.6074ms 68 23.638us 2.6240us 66.015us void CUTENSOR_NAMESPACE::tensor_elementwise_kernel<CUTENSOR_NAMESPACE::pw_config_t, float, float, float, float, bool=1, cutensorOperator_t=1, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t>(CUTENSOR_NAMESPACE::pw_params_t, int, int, unsigned int=0, int=256 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=64 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=1 const *, unsigned int=64 const **, cutensorOperator_t, void const *, cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const )
0.05% 1.4812ms 7 211.60us 133.69us 315.70us volta_scudnn_128x64_relu_xregs_large_nn_v1
0.05% 1.4676ms 20 73.379us 30.175us 145.89us volta_scudnn_128x32_relu_medium_nn_v1
0.04% 1.1343ms 263 4.3120us 1.2800us 15.040us cask_trt::computeOffsetsKernel(cask_trt::ComputeOffsetsParams)
0.03% 993.83us 12 82.819us 47.390us 120.25us volta_scudnn_128x32_sliced1x4_ldg4_relu_exp_interior_nhwc_tn_v1
0.03% 767.69us 12 63.974us 9.3120us 151.20us void genericReformat::copyPackedKernel<float, float, bool=0, bool=1, genericReformat::ArrayN<int=5>, int=5>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::ArrayN<int=5>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::ArrayN<int=5>>, void const *, int, int, int, float const , void, void const *, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, void const *, int, int, int, float const , int=5)
0.02% 649.36us 3 216.45us 216.00us 216.80us void explicit_convolve_sgemm<float, int, int=1024, int=6, int=7, int=3, int=3, int=5, int=0, bool=0>(int, int, int, float const *, int, float const , int, float, kernel_conv_params, __int64, int, __int64, int, float, float, int, float const *, float const *)
0.02% 634.03us 12 52.836us 19.935us 92.350us void pooling_fw_4d_kernel<float, float, cudnn::maxpooling_func<float, cudnnNanPropagation_t=0>, cudnnPoolingMode_t=0, bool=0>(cudnnTensorStruct, float const , cudnnTensorStruct, float, cudnnPoolingStruct, float, float, int, cudnn::reduced_divisor, cudnn::reduced_divisor)
0.02% 457.04us 12 38.086us 4.7350us 89.694us void genericReformat::copyPackedKernel<float, float, bool=1, bool=1, genericReformat::ArrayN<int=5>, int=5>(unsigned int, unsigned int, void const *, genericReformat::ArrayN<genericReformat::ArrayN<int=5>>, genericReformat::ArrayNWithReducedDivisors<genericReformat::ArrayN<int=5>>, void const *, int, int, int, float const , void, void const *, genericReformat::ArrayNWithReducedDivisors, genericReformat::ArrayNWithReducedDivisors, void const *, int, int, int, float const , int=5)
0.02% 453.37us 97 4.6730us 2.1760us 15.104us cask_cudnn::computeOffsetsKernel(cask_cudnn::ComputeOffsetsParams)
0.01% 371.22us 12 30.935us 3.4560us 76.958us void CUTENSOR_NAMESPACE::tensor_elementwise_kernel<CUTENSOR_NAMESPACE::pw_config_t, float, float, float, float, bool=1, cutensorOperator_t=1, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t>(CUTENSOR_NAMESPACE::pw_params_t, int, int, unsigned int=1, int=32 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=256 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=1 const *, unsigned int=256 const **, cutensorOperator_t, void const *, cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const )
0.01% 286.01us 22 13.000us 4.0320us 37.311us void nchwToNhwcKernel<float, float, float, bool=1, bool=0, cudnnKernelDataType_t=0>(int, int, int, int, float const , float, float, float)
0.01% 219.84us 15 14.655us 2.3360us 90.110us void cask_trt::generateWinogradTilesKernel<int=0, cask_trt::Element, cask_trt::Element, cask_trt::Element>(cask_trt::GenerateWinogradTilesParams)
0.01% 186.17us 19 9.7980us 3.6800us 28.511us void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)
0.01% 169.76us 4 42.438us 41.823us 43.902us void precomputed_convolve_sgemm<float, int=1024, int=5, int=5, int=4, int=3, int=3, int=1, bool=0>(int, int, int, float const , int, float, float const *, kernel_conv_params, __int64, int, float, float, int, float const *, float const , int)
0.00% 134.17us 4 33.543us 32.959us 34.399us void implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=0, bool=1, bool=1>(int, int, int, float const , int, float, float const , kernel_conv_params, __int64, int, float, float, int, float const , float const , bool, int, int)
0.00% 111.71us 19 5.8790us 4.0640us 8.2870us void nhwcToNchwKernel<float, float, float, bool=1, bool=0, cudnnKernelDataType_t=0>(int, int, int, int, float const , float, float, float)
0.00% 101.95us 4 25.487us 25.344us 25.567us volta_scudnn_128x32_relu_interior_nn_v1
0.00% 58.142us 3 19.380us 19.200us 19.583us void fft2d_c2r_32x32<float, bool=0, bool=0, unsigned int=1, bool=0, bool=0>(float
, float2 const , int, int, int, int, int, int, int, int, int, float, float, cudnn::reduced_divisor, bool, float, float
, int2, int, int)
0.00% 6.1760us 4 1.5440us 1.4080us 1.9200us void cudnn::cnn::kern_precompute_indices<bool=0>(int
, int, int, int, int, int, int)
API calls: 42.94% 5.19685s 15148 343.07us 1.7000us 14.697ms cudaEventSynchronize
19.34% 2.34071s 64 36.574ms 33.932ms 45.032ms cuLinkAddData
9.68% 1.17156s 4445 263.57us 266ns 323.03ms cudaFree
8.22% 995.41ms 3127 318.33us 219.19us 3.4118ms cuModuleLoadData
6.94% 839.51ms 4119 203.81us 1.3820us 11.946ms cuModuleUnload
5.25% 634.97ms 15833 40.103us 2.9630us 494.13ms cudaLaunchKernel
2.41% 291.76ms 27 10.806ms 797ns 291.56ms cudaStreamCreateWithFlags
1.30% 157.17ms 4510 34.849us 2.8560us 1.0057ms cudaMalloc
1.09% 131.93ms 3 43.976ms 1.8500us 131.92ms cudaStreamCreate
0.66% 80.450ms 15148 5.3100us 1.7300us 280.09us cudaStreamAddCallback
0.63% 75.873ms 31149 2.4350us 349ns 265.53us cudaEventRecord
0.61% 73.947ms 4464 16.565us 2.2680us 594.94us cudaMemcpyAsync
0.31% 37.362ms 15148 2.4660us 915ns 51.912us cudaEventElapsedTime
0.18% 22.062ms 3776 5.8420us 3.9820us 260.13us cuLaunchKernel
0.13% 15.233ms 1894 8.0420us 3.9240us 270.12us cudaMemsetAsync
0.07% 8.2038ms 3589 2.2850us 752ns 173.71us cudaStreamSynchronize
0.04% 5.4464ms 30167 180ns 84ns 24.472us cudaGetLastError
0.03% 3.8169ms 1232 3.0980us 1.4240us 28.122us cudaCreateTextureObject
0.03% 3.4549ms 2 1.7275ms 1.4583ms 1.9966ms cudaMallocHost
0.03% 3.4277ms 64 53.557us 47.437us 208.13us cuLinkComplete
0.02% 2.1417ms 1219 1.7560us 1.1310us 23.934us cudaDestroyTextureObject
0.02% 2.1373ms 64 33.395us 31.305us 37.358us cuLinkCreate
0.02% 1.9722ms 2537 777ns 348ns 19.477us cudaStreamWaitEvent
0.01% 1.7999ms 3127 575ns 339ns 224.43us cuModuleGetFunction
0.01% 956.12us 3 318.71us 16.119us 496.65us cudaHostAlloc
0.01% 933.78us 6 155.63us 141.82us 200.05us cuDeviceTotalMem
0.01% 677.25us 10 67.725us 59.173us 98.308us cudaGetDeviceProperties
0.00% 581.08us 226 2.5710us 194ns 88.156us cudaDeviceGetAttribute
0.00% 530.88us 567 936ns 97ns 44.770us cuDeviceGetAttribute
0.00% 438.07us 40 10.951us 6.7810us 26.221us cudaMemcpy2DAsync
0.00% 393.13us 1014 387ns 253ns 1.7390us cudaFuncSetAttribute
0.00% 260.15us 2 130.07us 8.3540us 251.79us cudaFreeHost
0.00% 174.44us 85 2.0520us 1.0780us 22.255us cudaEventCreate
0.00% 171.43us 1232 139ns 87ns 188ns cudaCreateChannelDesc
0.00% 139.43us 296 471ns 273ns 2.1990us cudaEventCreateWithFlags
0.00% 120.05us 185 648ns 270ns 4.7390us cudaEventDestroy
0.00% 115.30us 7 16.471us 5.8420us 38.949us cudaMemset
0.00% 98.480us 15 6.5650us 854ns 74.695us cudaStreamCreateWithPriority
0.00% 97.549us 6 16.258us 12.444us 21.282us cuDeviceGetName
0.00% 72.887us 25 2.9150us 1.1990us 22.631us cudaStreamDestroy
0.00% 69.062us 18 3.8360us 122ns 16.226us cudaMemcpy
0.00% 39.609us 29 1.3650us 345ns 16.599us cudaGetDevice
0.00% 27.471us 64 429ns 337ns 1.6650us cuLinkDestroy
0.00% 24.731us 8 3.0910us 933ns 13.849us cudaSetDevice
0.00% 19.609us 16 1.2250us 711ns 2.2470us cudaDeviceSynchronize
0.00% 6.0360us 6 1.0060us 846ns 1.2400us cuInit
0.00% 5.2070us 11 473ns 79ns 3.1230us cudaGetDeviceCount
0.00% 3.0680us 3 1.0220us 595ns 1.4550us cudaHostGetDevicePointer
0.00% 2.4980us 8 312ns 155ns 636ns cuDeviceGetCount
0.00% 2.1010us 6 350ns 250ns 485ns cuDriverGetVersion
0.00% 1.8570us 3 619ns 561ns 679ns cudaDeviceGetStreamPriorityRange
0.00% 1.7200us 5 344ns 304ns 387ns cuDevicePrimaryCtxRelease
0.00% 1.5600us 1 1.5600us 1.5600us 1.5600us cuDeviceGetPCIBusId
0.00% 1.5390us 7 219ns 148ns 443ns cuDeviceGet
0.00% 1.3640us 6 227ns 165ns 298ns cuDeviceGetUuid
0.00% 817ns 6 136ns 127ns 153ns cudaRuntimeGetVersion
======== Error: Application returned non-zero code 12

Hi Salini,

You appear to be using nvprof, which is the deprecated profiling tool rather than Nsight Systems.

Nsight Systems is available for multiple targets and multiple host OSs. To choose the right package, first consider the target system to be analyzed.

see Installation Guide :: Nsight Systems Documentation for more information

nsys profile ./build/deepstream-app

**** collection configuration ****
force-overwrite = false
stop-on-exit = true
export_sqlite = false
stats = false
capture-range = none
stop-on-range-end = false
Beta: ftrace events:
ftrace-keep-user-config = false
trace-GPU-context-switch = false
delay = 0 seconds
duration = 0 seconds
kill = signal number 15
inherit-environment = true
show-output = true
trace-fork-before-exec = false
sample_cpu = true
backtrace_method = LBR
wait = all
trace_cublas = false
trace_cuda = true
trace_cudnn = false
trace_nvtx = true
trace_mpi = false
trace_openacc = false
trace_vulkan = false
trace_opengl = true
trace_osrt = true
osrt-threshold = 0 nanoseconds
cudabacktrace = false
cudabacktrace-threshold = 0 nanoseconds
profile_processes = tree
application command = ./build/deepstream-app
application arguments =
application working directory = /home/salini/OneDrive/deepstream_va_yolox/deepstream_va/deepstream-app
NVTX profiler range trigger =
NVTX profiler domain trigger =
environment variables:
Collecting data…
Segmentation fault (core dumped)

nsys is giving me segmentation fault…I am not able to profile at all…

Hardware Platform (Jetson / GPU) GPU RTX 2060
• DeepStream Version 5.0
• TensorRT Version 7.1
• NVIDIA GPU Driver Version (valid for GPU only) 450

@liuyis can you take a look at this?

Hi @salini.radhika , which Nsys version were you using? The output seems to indicate it’s an old one. Could you try the latest version (2021.4) from Download Center | NVIDIA Developer?