How to avoid the error ==ERROR== Failed to prepare kernel for profiling

Hi all,

I am trying to use the tool NVIDIA Nsight Compute 2021.1.1 to profile kernels by launching trtexec with YOLOv3-Tiny using a host (CUDA Tool-kit 9.1 and Nsight compute 2021 are both installed in the host) and a Jestson AGX Xavier as a target, but i am getting some errors like ==ERROR== Failed to prepare kernel for profiling, that’s my confing of Nsight compute UI :

Host
OS : Ubuntu 18.04
CUDA compiler : 11.3
CUDA tool-kit : 9.1
NVIDIA Nsight Compute : 2021.1.1

Target : Jetson AGX Xavier
JetPack version : 4.5.1

Logs


Checking file deployment: libcuda-injection.so

Checking file deployment: libInterceptorInjectionTarget.so

Checking file deployment: libnvperf_host.so

Checking file deployment: libnvperf_target.so

Checking file deployment: libnvperfapi64.so

Checking file deployment: libTreeLauncherPlaceholder.so

Checking file deployment: libTreeLauncherTargetInjection.so

Checking file deployment: libTreeLauncherTargetUpdatePreloadInjection.so

Checking file deployment: ncu

Checking file deployment: TreeLauncherSubreaper

Checking file deployment: TreeLauncherTargetLdPreloadHelper

Checking file deployment: ComputeWorkloadAnalysis.section

Checking file deployment: CPIStall.py

Checking file deployment: HighPipeUtilization.py

Checking file deployment: InstructionStatistics.section

Checking file deployment: IssueSlotUtilization.py

Checking file deployment: LaunchStatistics.py

Checking file deployment: LaunchStatistics.section

Checking file deployment: MemoryL2Compression.py

Checking file deployment: MemoryWorkloadAnalysis.section

Checking file deployment: MemoryWorkloadAnalysis_Chart.section

Checking file deployment: MemoryWorkloadAnalysis_Deprecated.section

Checking file deployment: MemoryWorkloadAnalysis_Tables.section

Checking file deployment: Nvlink.section

Checking file deployment: Nvlink_Tables.section

Checking file deployment: Nvlink_Topology.section

Checking file deployment: NvRules.py

Checking file deployment: Occupancy.py

Checking file deployment: Occupancy.section

Checking file deployment: SchedulerStatistics.section

Checking file deployment: SlowPipeLimiter.py

Checking file deployment: SourceCounters.section

Checking file deployment: SpeedOfLight.py

Checking file deployment: SpeedOfLight.section

Checking file deployment: SpeedOfLight_Roofline.py

Checking file deployment: SpeedOfLight_RooflineChart.section

Checking file deployment: ThreadDivergence.py

Checking file deployment: UncoalescedAccess.py

Checking file deployment: UncoalescedSharedAccess.py

Checking file deployment: WarpStateStatistics.section

Launching: /tmp/var/target/linux-desktop-t210-a64/ncu (host: 192.168.55.1)

Process launched

==PROF== Attempting to connect to ncu-ui at 192.168.41.61:50152...

==PROF== Connected to ncu-ui at 192.168.41.61:50152.

&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=yolov3_onnx/yolov3-tiny-416.onnx --best --workspace=2048 --saveEngine=yolov3-tiny-416-bs1.trt --calib=calib_yolov3-tiny-int8-416.bin --dumpProfile

[06/17/2021-13:56:24] [I] === Model Options ===

[06/17/2021-13:56:24] [I] Format: ONNX

[06/17/2021-13:56:24] [I] Model: yolov3_onnx/yolov3-tiny-416.onnx

[06/17/2021-13:56:24] [I] Output:

[06/17/2021-13:56:24] [I] === Build Options ===

[06/17/2021-13:56:24] [I] Max batch: 1

[06/17/2021-13:56:24] [I] Workspace: 2048 MB

[06/17/2021-13:56:24] [I] minTiming: 1

[06/17/2021-13:56:24] [I] avgTiming: 8

[06/17/2021-13:56:24] [I] Precision: FP32+FP16+INT8

[06/17/2021-13:56:24] [I] Calibration: calib_yolov3-tiny-int8-416.bin

[06/17/2021-13:56:24] [I] Safe mode: Disabled

[06/17/2021-13:56:24] [I] Save engine: yolov3-tiny-416-bs1.trt

[06/17/2021-13:56:24] [I] Load engine:

[06/17/2021-13:56:24] [I] Builder Cache: Enabled

[06/17/2021-13:56:24] [I] NVTX verbosity: 0

[06/17/2021-13:56:24] [I] Inputs format: fp32:CHW

[06/17/2021-13:56:24] [I] Outputs format: fp32:CHW

[06/17/2021-13:56:24] [I] Input build shapes: model

[06/17/2021-13:56:24] [I] Input calibration shapes: model

[06/17/2021-13:56:24] [I] === System Options ===

[06/17/2021-13:56:24] [I] Device: 0

[06/17/2021-13:56:24] [I] DLACore:

[06/17/2021-13:56:24] [I] Plugins:

[06/17/2021-13:56:24] [I] === Inference Options ===

[06/17/2021-13:56:24] [I] Batch: 1

[06/17/2021-13:56:24] [I] Input inference shapes: model

[06/17/2021-13:56:24] [I] Iterations: 10

[06/17/2021-13:56:24] [I] Duration: 3s (+ 200ms warm up)

[06/17/2021-13:56:24] [I] Sleep time: 0ms

[06/17/2021-13:56:24] [I] Streams: 1

[06/17/2021-13:56:24] [I] ExposeDMA: Disabled

[06/17/2021-13:56:24] [I] Spin-wait: Disabled

[06/17/2021-13:56:24] [I] Multithreading: Disabled

[06/17/2021-13:56:24] [I] CUDA Graph: Disabled

[06/17/2021-13:56:24] [I] Skip inference: Disabled

[06/17/2021-13:56:24] [I] Inputs:

[06/17/2021-13:56:24] [I] === Reporting Options ===

[06/17/2021-13:56:24] [I] Verbose: Disabled

[06/17/2021-13:56:24] [I] Averages: 10 inferences

[06/17/2021-13:56:24] [I] Percentile: 99

[06/17/2021-13:56:24] [I] Dump output: Disabled

[06/17/2021-13:56:24] [I] Profile: Enabled

[06/17/2021-13:56:24] [I] Export timing to JSON file:

[06/17/2021-13:56:24] [I] Export output to JSON file:

[06/17/2021-13:56:24] [I] Export profile to JSON file:

[06/17/2021-13:56:24] [I]

==PROF== Connected to process 19111 (/usr/src/tensorrt/bin/trtexec)

==PROF== Connected to process 19111 (/usr/src/tensorrt/bin/trtexec)

----------------------------------------------------------------

Input filename: yolov3_onnx/yolov3-tiny-416.onnx

ONNX IR version: 0.0.4

Opset version: 9

Producer name: NVIDIA TensorRT sample

Producer version:

Domain:

Model version: 0

Doc string:

----------------------------------------------------------------

==ERROR== Failed to prepare kernel for profiling

==ERROR== Failed to profile kernel "fusedConvolutionReluKernel" in process 19111

==ERROR== Failed to prepare kernel for profiling

==ERROR== Failed to profile kernel "fusedConvolutionReluKernel" in process 19111

[06/17/2021-13:56:51] [I] [TRT] Detected 1 inputs and 2 output network tensors.

[06/17/2021-13:56:51] [I] [TRT] Starting Calibration.

[06/17/2021-13:56:51] [I] [TRT] Calibrated batch 0 in 0.112621 seconds.

[06/17/2021-13:56:54] [I] [TRT] Post Processing Calibration data in 2.79758 seconds.

[06/17/2021-13:56:54] [I] [TRT] Calibration completed in 26.8958 seconds.

[06/17/2021-13:56:54] [I] [TRT] Writing Calibration Cache for calibrator: TRT-7103-EntropyCalibration2

Thanks

Hi,

Please use JetPack installer for the compatible host version of CUDA and Nsight Compute.
For JetPack 4.5.1, it should be CUDA 10.2 and Nsight Compute 2019.5.

Path: /opt/nvidia/nsight-compute/2019.5.0

Thanks.

@AastaLLL but the nsight compute 2019 hasn’t Linux aarch64 in the section target to select the Jetson like in the pic

How can i select the jeston please ?

Thanks

Hi,

Please make sure you are using the Nsight Compute installed from JetPack host package.
You should find an aarch64 choice as below:

Thanks.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.