Hi
I had (have) problem with profiling RNNT from MLPerf with 2022 versions of Nsight Systems. The latest 2202.4.1 version fails with the following error:
$ ~/nsight-systems-2022.4.1/bin/nsys profile -t cuda -o nsys_rnnt ./build/bin/harness_rnnt --logfile_outdir="/work/build/logs/2022.07.21-09.39.05/mahmood2022_TRT/rnnt/Offline" --logfile_prefix="mlperf_log_" --performance_sample_count=2513 --audio_batch_size=256 --audio_buffer_num_lines=4096 --dali_batches_issue_ahead=4 --dali_pipeline_depth=4 --num_warmups=512 --mlperf_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/mlperf.conf" --user_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/user.conf" --batch_size=128 --cuda_graph=true --pipelined_execution=true --batch_sorting=true --enable_audio_processing=true --use_copy_kernel=true --streams_per_gpu=1 --audio_fp16_input=true --start_from_device=false --audio_serialized_pipeline_file="build/bin/dali/dali_pipeline_gpu_fp16.pth" --scenario Offline --model rnnt --engine_dir="./build/engines/mahmood2022/rnnt/Offline"
&&&& RUNNING RNN-T_Harness # /work/./build/bin/harness_rnnt
I1029 11:53:26.010453 916 main_rnnt.cc:2903] Found 1 GPUs
[I] Starting creating QSL.
[I] Finished creating QSL.
[I] Starting creating SUT.
[I] Set to device 0
Dali pipeline creating..
Dali pipeline created
[I] Creating stream 0/1
[I] [TRT] [MemUsageChange] Init CUDA: CPU +530, GPU +0, now: CPU 965, GPU 2720 (MiB)
[I] [TRT] Loaded engine size: 81 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1239, GPU +348, now: CPU 2388, GPU 3070 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +178, GPU +56, now: CPU 2566, GPU 3126 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2593, GPU 3186 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2593, GPU 3194 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +232, now: CPU 0, GPU 232 (MiB)
[I] Created RnntEncoder runner: encoder
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2593, GPU 3428 (MiB)
[I] [TRT] Loaded engine size: 3 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2599, GPU 3436 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2599, GPU 3446 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 232 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2599, GPU 3450 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2599, GPU 3458 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1, now: CPU 0, GPU 233 (MiB)
[I] Created RnntDecoder runner: decoder
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2600, GPU 3458 (MiB)
[I] [TRT] Loaded engine size: 1 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointFc1 runner: fc1_a
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2600, GPU 3458 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2601, GPU 3466 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2601, GPU 3466 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointFc1 runner: fc1_b
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2600, GPU 3466 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2601, GPU 3474 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2601, GPU 3474 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointBackend runner: joint_backend
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2601, GPU 3474 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2601, GPU 3482 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2601, GPU 3492 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2601, GPU 3484 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2601, GPU 3492 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntIsel runner: isel
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2601, GPU 3492 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 0, GPU 236 (MiB)
[I] Created RnntIgather runner: igather
[I] Instantiated RnntEngineContainer runner
cudaMemcpy blocking
cudaMemcpy blocking
[I] Instantiated RnntTensorContainer host memory
Stream::Stream sampleSize: 61440
Stream::Stream singleSampleSize: 480
Stream::Stream fullseqSampleSize: 61440
Stream::Stream mBatchSize: 128
[E] [TRT] 3: [executionContext.cpp::setBindingDimensions::943] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::943, condition: profileMaxDims.d[i] >= dimensions.d[i]. Supplied binding dimension [32,1] for bindings[0] exceed min ~ max range at index 0, maximum dimension in profile is 16, minimum dimension in profile is 1, but supplied dimension is 32.
)
F1029 11:53:28.426635 916 main_rnnt.cc:780] Check failed: context->setBindingDimensions(bindingIdx, inputDims) == true (0 vs. 1)
*** Check failure stack trace: ***
@ 0x7efe740b6f00 google::LogMessage::Fail()
@ 0x7efe740b6e3b google::LogMessage::SendToLog()
@ 0x7efe740b676c google::LogMessage::Flush()
@ 0x7efe740b9d7a google::LogMessageFatal::~LogMessageFatal()
@ 0x558e99094860 EngineRunner::enqueue()
@ 0x558e990792ad doBatchDecoderIteration()
@ 0x558e99079bb4 makeDecoderGraph()
@ 0x558e9907d064 Stream::Stream()
@ 0x558e9907e531 RNNTServer::RNNTServer()
@ 0x558e99074057 main
@ 0x7efe1fe52083 __libc_start_main
@ 0x558e99074c9e _start
@ (nil) (unknown)
Generating '/tmp/nsys-report-b73d.qdstrm'
[1/1] [========================100%] nsys_rnnt.nsys-rep
Generated:
/work/nsys_rnnt.nsys-rep
I have to say that I don’t have any problem with other benchmarks and also when I use the RNNT command to run on the device, the program finishes without any error.
Any thought on that?