Hi all,
I am trying to use the tool NVIDIA Nsight Compute 2021.1.1 to profile kernels by launching trtexec with YOLOv3-Tiny using a host (CUDA Tool-kit 9.1 and Nsight compute 2021 are both installed in the host) and a Jestson AGX Xavier as a target, but i am getting some errors like ==ERROR== Failed to prepare kernel for profiling, that’s my confing of Nsight compute UI :
Host
OS : Ubuntu 18.04
CUDA compiler : 11.3
CUDA tool-kit : 9.1
NVIDIA Nsight Compute : 2021.1.1
Target : Jetson AGX Xavier
JetPack version : 4.5.1
Logs
Checking file deployment: libcuda-injection.so
Checking file deployment: libInterceptorInjectionTarget.so
Checking file deployment: libnvperf_host.so
Checking file deployment: libnvperf_target.so
Checking file deployment: libnvperfapi64.so
Checking file deployment: libTreeLauncherPlaceholder.so
Checking file deployment: libTreeLauncherTargetInjection.so
Checking file deployment: libTreeLauncherTargetUpdatePreloadInjection.so
Checking file deployment: ncu
Checking file deployment: TreeLauncherSubreaper
Checking file deployment: TreeLauncherTargetLdPreloadHelper
Checking file deployment: ComputeWorkloadAnalysis.section
Checking file deployment: CPIStall.py
Checking file deployment: HighPipeUtilization.py
Checking file deployment: InstructionStatistics.section
Checking file deployment: IssueSlotUtilization.py
Checking file deployment: LaunchStatistics.py
Checking file deployment: LaunchStatistics.section
Checking file deployment: MemoryL2Compression.py
Checking file deployment: MemoryWorkloadAnalysis.section
Checking file deployment: MemoryWorkloadAnalysis_Chart.section
Checking file deployment: MemoryWorkloadAnalysis_Deprecated.section
Checking file deployment: MemoryWorkloadAnalysis_Tables.section
Checking file deployment: Nvlink.section
Checking file deployment: Nvlink_Tables.section
Checking file deployment: Nvlink_Topology.section
Checking file deployment: NvRules.py
Checking file deployment: Occupancy.py
Checking file deployment: Occupancy.section
Checking file deployment: SchedulerStatistics.section
Checking file deployment: SlowPipeLimiter.py
Checking file deployment: SourceCounters.section
Checking file deployment: SpeedOfLight.py
Checking file deployment: SpeedOfLight.section
Checking file deployment: SpeedOfLight_Roofline.py
Checking file deployment: SpeedOfLight_RooflineChart.section
Checking file deployment: ThreadDivergence.py
Checking file deployment: UncoalescedAccess.py
Checking file deployment: UncoalescedSharedAccess.py
Checking file deployment: WarpStateStatistics.section
Launching: /tmp/var/target/linux-desktop-t210-a64/ncu (host: 192.168.55.1)
Process launched
==PROF== Attempting to connect to ncu-ui at 192.168.41.61:50152...
==PROF== Connected to ncu-ui at 192.168.41.61:50152.
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=yolov3_onnx/yolov3-tiny-416.onnx --best --workspace=2048 --saveEngine=yolov3-tiny-416-bs1.trt --calib=calib_yolov3-tiny-int8-416.bin --dumpProfile
[06/17/2021-13:56:24] [I] === Model Options ===
[06/17/2021-13:56:24] [I] Format: ONNX
[06/17/2021-13:56:24] [I] Model: yolov3_onnx/yolov3-tiny-416.onnx
[06/17/2021-13:56:24] [I] Output:
[06/17/2021-13:56:24] [I] === Build Options ===
[06/17/2021-13:56:24] [I] Max batch: 1
[06/17/2021-13:56:24] [I] Workspace: 2048 MB
[06/17/2021-13:56:24] [I] minTiming: 1
[06/17/2021-13:56:24] [I] avgTiming: 8
[06/17/2021-13:56:24] [I] Precision: FP32+FP16+INT8
[06/17/2021-13:56:24] [I] Calibration: calib_yolov3-tiny-int8-416.bin
[06/17/2021-13:56:24] [I] Safe mode: Disabled
[06/17/2021-13:56:24] [I] Save engine: yolov3-tiny-416-bs1.trt
[06/17/2021-13:56:24] [I] Load engine:
[06/17/2021-13:56:24] [I] Builder Cache: Enabled
[06/17/2021-13:56:24] [I] NVTX verbosity: 0
[06/17/2021-13:56:24] [I] Inputs format: fp32:CHW
[06/17/2021-13:56:24] [I] Outputs format: fp32:CHW
[06/17/2021-13:56:24] [I] Input build shapes: model
[06/17/2021-13:56:24] [I] Input calibration shapes: model
[06/17/2021-13:56:24] [I] === System Options ===
[06/17/2021-13:56:24] [I] Device: 0
[06/17/2021-13:56:24] [I] DLACore:
[06/17/2021-13:56:24] [I] Plugins:
[06/17/2021-13:56:24] [I] === Inference Options ===
[06/17/2021-13:56:24] [I] Batch: 1
[06/17/2021-13:56:24] [I] Input inference shapes: model
[06/17/2021-13:56:24] [I] Iterations: 10
[06/17/2021-13:56:24] [I] Duration: 3s (+ 200ms warm up)
[06/17/2021-13:56:24] [I] Sleep time: 0ms
[06/17/2021-13:56:24] [I] Streams: 1
[06/17/2021-13:56:24] [I] ExposeDMA: Disabled
[06/17/2021-13:56:24] [I] Spin-wait: Disabled
[06/17/2021-13:56:24] [I] Multithreading: Disabled
[06/17/2021-13:56:24] [I] CUDA Graph: Disabled
[06/17/2021-13:56:24] [I] Skip inference: Disabled
[06/17/2021-13:56:24] [I] Inputs:
[06/17/2021-13:56:24] [I] === Reporting Options ===
[06/17/2021-13:56:24] [I] Verbose: Disabled
[06/17/2021-13:56:24] [I] Averages: 10 inferences
[06/17/2021-13:56:24] [I] Percentile: 99
[06/17/2021-13:56:24] [I] Dump output: Disabled
[06/17/2021-13:56:24] [I] Profile: Enabled
[06/17/2021-13:56:24] [I] Export timing to JSON file:
[06/17/2021-13:56:24] [I] Export output to JSON file:
[06/17/2021-13:56:24] [I] Export profile to JSON file:
[06/17/2021-13:56:24] [I]
==PROF== Connected to process 19111 (/usr/src/tensorrt/bin/trtexec)
==PROF== Connected to process 19111 (/usr/src/tensorrt/bin/trtexec)
----------------------------------------------------------------
Input filename: yolov3_onnx/yolov3-tiny-416.onnx
ONNX IR version: 0.0.4
Opset version: 9
Producer name: NVIDIA TensorRT sample
Producer version:
Domain:
Model version: 0
Doc string:
----------------------------------------------------------------
==ERROR== Failed to prepare kernel for profiling
==ERROR== Failed to profile kernel "fusedConvolutionReluKernel" in process 19111
==ERROR== Failed to prepare kernel for profiling
==ERROR== Failed to profile kernel "fusedConvolutionReluKernel" in process 19111
[06/17/2021-13:56:51] [I] [TRT] Detected 1 inputs and 2 output network tensors.
[06/17/2021-13:56:51] [I] [TRT] Starting Calibration.
[06/17/2021-13:56:51] [I] [TRT] Calibrated batch 0 in 0.112621 seconds.
[06/17/2021-13:56:54] [I] [TRT] Post Processing Calibration data in 2.79758 seconds.
[06/17/2021-13:56:54] [I] [TRT] Calibration completed in 26.8958 seconds.
[06/17/2021-13:56:54] [I] [TRT] Writing Calibration Cache for calibrator: TRT-7103-EntropyCalibration2
Thanks