Nsys fails with RNNT

Hi
I had (have) problem with profiling RNNT from MLPerf with 2022 versions of Nsight Systems. The latest 2202.4.1 version fails with the following error:

$ ~/nsight-systems-2022.4.1/bin/nsys profile -t cuda -o nsys_rnnt ./build/bin/harness_rnnt --logfile_outdir="/work/build/logs/2022.07.21-09.39.05/mahmood2022_TRT/rnnt/Offline" --logfile_prefix="mlperf_log_" --performance_sample_count=2513 --audio_batch_size=256 --audio_buffer_num_lines=4096 --dali_batches_issue_ahead=4 --dali_pipeline_depth=4 --num_warmups=512 --mlperf_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/mlperf.conf" --user_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/user.conf" --batch_size=128 --cuda_graph=true --pipelined_execution=true --batch_sorting=true --enable_audio_processing=true --use_copy_kernel=true --streams_per_gpu=1 --audio_fp16_input=true --start_from_device=false --audio_serialized_pipeline_file="build/bin/dali/dali_pipeline_gpu_fp16.pth" --scenario Offline --model rnnt --engine_dir="./build/engines/mahmood2022/rnnt/Offline"
&&&& RUNNING RNN-T_Harness # /work/./build/bin/harness_rnnt
I1029 11:53:26.010453   916 main_rnnt.cc:2903] Found 1 GPUs
[I] Starting creating QSL.
[I] Finished creating QSL.
[I] Starting creating SUT.
[I] Set to device 0
Dali pipeline creating..
Dali pipeline created
[I] Creating stream 0/1
[I] [TRT] [MemUsageChange] Init CUDA: CPU +530, GPU +0, now: CPU 965, GPU 2720 (MiB)
[I] [TRT] Loaded engine size: 81 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1239, GPU +348, now: CPU 2388, GPU 3070 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +178, GPU +56, now: CPU 2566, GPU 3126 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2593, GPU 3186 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2593, GPU 3194 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +232, now: CPU 0, GPU 232 (MiB)
[I] Created RnntEncoder runner: encoder
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2593, GPU 3428 (MiB)
[I] [TRT] Loaded engine size: 3 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2599, GPU 3436 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2599, GPU 3446 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 232 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2599, GPU 3450 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2599, GPU 3458 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1, now: CPU 0, GPU 233 (MiB)
[I] Created RnntDecoder runner: decoder
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2600, GPU 3458 (MiB)
[I] [TRT] Loaded engine size: 1 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointFc1 runner: fc1_a
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2600, GPU 3458 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2601, GPU 3466 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2601, GPU 3466 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointFc1 runner: fc1_b
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2600, GPU 3466 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2601, GPU 3474 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 2601, GPU 3474 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointBackend runner: joint_backend
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2601, GPU 3474 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2601, GPU 3482 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2601, GPU 3492 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2601, GPU 3484 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2601, GPU 3492 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntIsel runner: isel
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2601, GPU 3492 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 0, GPU 236 (MiB)
[I] Created RnntIgather runner: igather
[I] Instantiated RnntEngineContainer runner
cudaMemcpy blocking 
cudaMemcpy blocking 
[I] Instantiated RnntTensorContainer host memory
Stream::Stream sampleSize: 61440
Stream::Stream singleSampleSize: 480
Stream::Stream fullseqSampleSize: 61440
Stream::Stream mBatchSize: 128
[E] [TRT] 3: [executionContext.cpp::setBindingDimensions::943] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::943, condition: profileMaxDims.d[i] >= dimensions.d[i]. Supplied binding dimension [32,1] for bindings[0] exceed min ~ max range at index 0, maximum dimension in profile is 16, minimum dimension in profile is 1, but supplied dimension is 32.
)
F1029 11:53:28.426635   916 main_rnnt.cc:780] Check failed: context->setBindingDimensions(bindingIdx, inputDims) == true (0 vs. 1) 
*** Check failure stack trace: ***
    @     0x7efe740b6f00  google::LogMessage::Fail()
    @     0x7efe740b6e3b  google::LogMessage::SendToLog()
    @     0x7efe740b676c  google::LogMessage::Flush()
    @     0x7efe740b9d7a  google::LogMessageFatal::~LogMessageFatal()
    @     0x558e99094860  EngineRunner::enqueue()
    @     0x558e990792ad  doBatchDecoderIteration()
    @     0x558e99079bb4  makeDecoderGraph()
    @     0x558e9907d064  Stream::Stream()
    @     0x558e9907e531  RNNTServer::RNNTServer()
    @     0x558e99074057  main
    @     0x7efe1fe52083  __libc_start_main
    @     0x558e99074c9e  _start
    @              (nil)  (unknown)
Generating '/tmp/nsys-report-b73d.qdstrm'
[1/1] [========================100%] nsys_rnnt.nsys-rep
Generated:
    /work/nsys_rnnt.nsys-rep


I have to say that I don’t have any problem with other benchmarks and also when I use the RNNT command to run on the device, the program finishes without any error.

Any thought on that?

That actually looks to me like it is failing outside of Nsys. @skottapalli does this look like a CUPTI issue to you?

The log doesn’t point out the root cause.

  1. Please try profiling with -t none -s none --cpuctxsw=none to see if the workload runs successfully.
  2. If 1 works, then try -t cuda -s none --cpuctxsw=none to see if the workload runs successfully
  3. If 2 works, then try just -t cuda CLI options.

This will help identify which feature is causing your workload to fail under nsys.

Also, what is the output of nvidia-smi on the target system?

Same error observed with No. 1

$ ~/nsight-systems-2022.4.1/bin/nsys profile -t none -s none --cpuctxsw=none -o nsys_rnnt ./build/bin/harness_rnnt --logfile_outdir="/work/build/logs/2022.07.21-09.39.05/mahmood2022_TRT/rnnt/Offline" --logfile_prefix="mlperf_log_" --performance_sample_count=2513 --audio_batch_size=256 --audio_buffer_num_lines=4096 --dali_batches_issue_ahead=4 --dali_pipeline_depth=4 --num_warmups=512 --mlperf_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/mlperf.conf" --user_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/user.conf" --batch_size=128 --cuda_graph=true --pipelined_execution=true --batch_sorting=true --enable_audio_processing=true --use_copy_kernel=true --streams_per_gpu=1 --audio_fp16_input=true --start_from_device=false --audio_serialized_pipeline_file="build/bin/dali/dali_pipeline_gpu_fp16.pth" --scenario Offline --model rnnt --engine_dir="./build/engines/mahmood2022/rnnt/Offline"
&&&& RUNNING RNN-T_Harness # /work/./build/bin/harness_rnnt
I1102 17:07:35.201519  2764 main_rnnt.cc:2903] Found 1 GPUs
[I] Starting creating QSL.
[I] Finished creating QSL.
[I] Starting creating SUT.
[I] Set to device 0
Dali pipeline creating..
Dali pipeline created
[I] Creating stream 0/1
[I] [TRT] [MemUsageChange] Init CUDA: CPU +349, GPU +0, now: CPU 631, GPU 2566 (MiB)
[I] [TRT] Loaded engine size: 81 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +809, GPU +346, now: CPU 1624, GPU 2916 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +126, GPU +58, now: CPU 1750, GPU 2974 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1768, GPU 3032 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1768, GPU 3040 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +232, now: CPU 0, GPU 232 (MiB)
[I] Created RnntEncoder runner: encoder
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1768, GPU 3278 (MiB)
[I] [TRT] Loaded engine size: 3 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1774, GPU 3286 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 1775, GPU 3296 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 232 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1775, GPU 3300 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1775, GPU 3308 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1, now: CPU 0, GPU 233 (MiB)
[I] Created RnntDecoder runner: decoder
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1775, GPU 3308 (MiB)
[I] [TRT] Loaded engine size: 1 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointFc1 runner: fc1_a
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1775, GPU 3308 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1776, GPU 3316 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1776, GPU 3316 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointFc1 runner: fc1_b
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1776, GPU 3316 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1776, GPU 3324 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1776, GPU 3324 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointBackend runner: joint_backend
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1776, GPU 3324 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1776, GPU 3332 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1776, GPU 3342 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1776, GPU 3334 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1776, GPU 3342 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntIsel runner: isel
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1776, GPU 3342 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 0, GPU 236 (MiB)
[I] Created RnntIgather runner: igather
[I] Instantiated RnntEngineContainer runner
cudaMemcpy blocking 
cudaMemcpy blocking 
[I] Instantiated RnntTensorContainer host memory
Stream::Stream sampleSize: 61440
Stream::Stream singleSampleSize: 480
Stream::Stream fullseqSampleSize: 61440
Stream::Stream mBatchSize: 128
[E] [TRT] 3: [executionContext.cpp::setBindingDimensions::943] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::943, condition: profileMaxDims.d[i] >= dimensions.d[i]. Supplied binding dimension [32,1] for bindings[0] exceed min ~ max range at index 0, maximum dimension in profile is 16, minimum dimension in profile is 1, but supplied dimension is 32.
)
F1102 17:07:37.367681  2764 main_rnnt.cc:780] Check failed: context->setBindingDimensions(bindingIdx, inputDims) == true (0 vs. 1) 
*** Check failure stack trace: ***
    @     0x7f83f4715f00  google::LogMessage::Fail()
    @     0x7f83f4715e3b  google::LogMessage::SendToLog()
    @     0x7f83f471576c  google::LogMessage::Flush()
    @     0x7f83f4718d7a  google::LogMessageFatal::~LogMessageFatal()
    @     0x55ded2086860  EngineRunner::enqueue()
    @     0x55ded206b2ad  doBatchDecoderIteration()
    @     0x55ded206bbb4  makeDecoderGraph()
    @     0x55ded206f064  Stream::Stream()
    @     0x55ded2070531  RNNTServer::RNNTServer()
    @     0x55ded2066057  main
    @     0x7f83a04b1083  __libc_start_main
    @     0x55ded2066c9e  _start
    @              (nil)  (unknown)
Generating '/tmp/nsys-report-f0a9.qdstrm'
[1/1] [========================100%] nsys_rnnt.nsys-rep
Generated:
    /work/nsys_rnnt.nsys-rep

The output of nvidia-smi on the host machine (not the mlperf docker) is shown below:

$ nvidia-smi 
Wed Nov  2 18:09:21 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01    Driver Version: 510.39.01    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:2D:00.0  On |                  N/A |
|  0%   42C    P8    30W / 370W |    170MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1417      G   /usr/lib/xorg/Xorg                 23MiB |
|    0   N/A  N/A      2093      G   /usr/lib/xorg/Xorg                 87MiB |
|    0   N/A  N/A      2216      G   /usr/bin/gnome-shell               29MiB |
|    0   N/A  N/A      2730      G   ...mviewer/tv_bin/TeamViewer       13MiB |
+-----------------------------------------------------------------------------+

That is very strange. Option 1 turns off all profiling under nsys. It sounds like some environment or setup is not passing through to the workload correctly when run under nsys. Is it possible for you to share a repro? May be a simpler repro if you cannot share your original repro.

I used the same repository as uploaded in github/inference/v2.0/closed/nvidia.
I found one thing though. I noticed that when I run the rnnt via the make command, it prints “loading plugin” while in the generated command shown just before the test, that plugin option is missed. I traced that and manually applied --plugin option. It then showed some other errors about preprocessed_data_dir and other paths. I fixed all of them, so the correct command is this:

CMD=./build/bin/harness_rnnt \
--plugins="/work/build/plugins/RNNTOptPlugin/librnntoptplugin.so" \
--logfile_outdir="/work/build/logs/2022.11.02-17.19.24/mahmood2022_TRT/rnnt/Offline" \
--logfile_prefix="mlperf_log_" \
--performance_sample_count=2513 \
--audio_batch_size=256 \
--audio_buffer_num_lines=4096 \
--dali_batches_issue_ahead=4 \
--dali_pipeline_depth=4 \
--num_warmups=512 \
--raw_data_dir="/work/build/preprocessed_data/rnnt_dev_clean_500_raw" \
--raw_length_dir="/work/build/preprocessed_data/rnnt_dev_clean_500_raw/int32" \
--preprocessed_data_dir="/work/build/preprocessed_data/rnnt_dev_clean_512/fp16" \
--preprocessed_length_dir="/work/build/preprocessed_data/rnnt_dev_clean_512/int32"\
 --val_map="/work/data_maps/rnnt_dev_clean_512/val_map.txt" \
--mlperf_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/mlperf.conf" \
--user_conf_path="/work/measurements/mahmood2022_TRT/rnnt/Offline/user.conf" \
--batch_size=16 \
--cuda_graph=true \ 
--pipelined_execution=true \
--batch_sorting=true \
--enable_audio_processing=true \
--use_copy_kernel=true \
--streams_per_gpu=1 \
--audio_fp16_input=true \ 
--start_from_device=false \
--audio_serialized_pipeline_file="/work/build/bin/dali/dali_pipeline_gpu_fp16.pth" \
--scenario Offline \
--model rnnt \
--engine_dir="/work/build/engines/mahmood2022/rnnt/Offline"

Q: Would you please check that you also use the same command for rnnt?

I then used that command with nsys and as output shows, everything is fine.

~/nsight-systems/2022.4.1/bin/nsys profile -t cuda,cudnn,nvtx  -o nsys_rnnt $CMD

Please see the output at pastebin or justpase (if the pastebon doesn’t work).
After that I rannsys stats nsys_rnnt.nsys-rep and you can see the output at pastebin or justpaste (if pastebin doesn’t work).

The weird thing in the output is that, it doesn’t show the kernel information. I don’t know if there is a bug in the nsys profiler or the step it read the sql file and generates the report file.

Processing [nsys_rnnt.sqlite] with [/opt/nvidia/nsight-systems/2022.4.1/host-linux-x64/reports/gpukernsum.py]...
SKIPPED: nsys_rnnt.sqlite does not contain CUDA kernel data.

I have uploaded the nsys-rep file here. You can also download the sqlite file from here.

I haven’t modified the source code. I just followed the steps to build, preprocess and run the workloads. I also have to emphasis that other benchmarks are fine with nsys and I see this weird behavior with rnnt, only.

The github and pastebin links in your reply do NOT open for me. I either get 404 or site can’t be reached errors. Could you provide the correct links?

Did the previous errors about binding dimensions go away? From your latest reply, it looks like the CUDA events were not collected. Are you sure the workload uses the GPU?

If it does, then there might be something else going on here. Is the workload exiting cleanly? Does the app use multiprocessing module? If so, please make sure that you

  1. Use the set_start_method in the multiprocessing module to change the start method to “spawn” which is much safer and allows tools like Nsight Systems to collect data. See the code example given in the link multiprocessing — Process-based parallelism — Python 3.11.0 documentation
    On Linux, the multiprocessing module defaults to using the “fork” mode where it forks new processes, but does not call exec. According to the POSIX standard, fork without exec leads to undefined behavior and tools like Nsight Systems that rely on injection are only allowed to make async-signal-safe calls in such a process. This makes it very hard for tools like Nsight Systems to collect profiling information.
  2. Ensure that processes exit gracefully (by using close and join methods, for example, in the multiprocessing module’s objects). Otherwise, Nsight Systems cannot flush buffers properly and you might end up with missing CUDA traces.

I edited the previous post. Please see if the links work for you. If you still can not download the nsys-rep and sqlite files from Mediafire, I will upload them on another website. Sorry for inconvenience.

Are you sure the workload uses the GPU?

When I run the rnnt, it indeed uses GPU according to the nvidia-smi output

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1454      G   /usr/lib/xorg/Xorg                 23MiB |
|    0   N/A  N/A      3578      G   /usr/lib/xorg/Xorg                 97MiB |
|    0   N/A  N/A     44453      C   ./build/bin/harness_rnnt         7989MiB |
+-----------------------------------------------------------------------------+

Hi @mahmood.nt I checked with the mlperf team at NVIDIA and they recommended using the make command to run rnnt such as

nsys profile -t cuda,cudnn,nvtx -o nsys_rnnt make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=offline --test_mode=PerformanceOnly --fast"

Could you please try that?

It doesn’t work either…

(mlperf) mahmood@mlperf-inference-mahmood-x86_64:/work$ /home/mahmood/nsight-systems-2022.4.1/bin/nsys profile -t cuda,cudnn,nvtx -o nsys_rnnt make run_harness RUN_ARGS="--benchmarks=rnnt --scenarios=offline --test_mode=PerformanceOnly --fast"
[2022-11-13 12:51:05,831 main.py:770 INFO] Detected System ID: KnownSystem.mahmood2022
[2022-11-13 12:51:06,045 main.py:249 INFO] Running harness for rnnt benchmark in Offline scenario...
[2022-11-13 12:51:06,049 __init__.py:43 INFO] Running command: ./build/bin/harness_rnnt --logfile_outdir="/work/build/logs/2022.11.13-12.51.03/mahmood2022_TRT/rnnt/Offline" --logfile_prefix="mlperf_log_" --performance_sample_count=2513 --test_mode="PerformanceOnly" --audio_batch_size=256 --audio_buffer_num_lines=4096 --dali_batches_issue_ahead=4 --dali_pipeline_depth=4 --num_warmups=512 --mlperf_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/mlperf.conf" --user_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/user.conf" --batch_size=16 --cuda_graph=true --pipelined_execution=true --batch_sorting=true --enable_audio_processing=true --use_copy_kernel=true --streams_per_gpu=1 --audio_fp16_input=true --start_from_device=false --audio_serialized_pipeline_file="build/bin/dali/dali_pipeline_gpu_fp16.pth" --scenario Offline --model rnnt --engine_dir="./build/engines/mahmood2022/rnnt/Offline"
[2022-11-13 12:51:06,049 __init__.py:50 INFO] Overriding Environment
audio_batch_size : 256
audio_buffer_num_lines : 4096
benchmark : Benchmark.RNNT
dali_batches_issue_ahead : 4
dali_pipeline_depth : 4
gpu_batch_size : 16
gpu_copy_streams : 1
gpu_inference_streams : 1
input_dtype : fp16
input_format : linear
map_path : data_maps/rnnt_dev_clean_512/val_map.txt
num_warmups : 512
offline_expected_qps : 13300
precision : fp16
scenario : Scenario.Offline
system : SystemConfiguration(host_cpu_conf=CPUConfiguration(layout={CPU(name='AMD Ryzen 7 3700X 8-Core Processor', architecture=<CPUArchitecture.x86_64: AliasedName(name='x86_64', aliases=(), patterns=())>, core_count=8, threads_per_core=2): 1}), host_mem_conf=MemoryConfiguration(host_memory_capacity=Memory(quantity=131.833424, byte_suffix=<ByteSuffix.GB: (1000, 3)>, _num_bytes=131833424000), comparison_tolerance=0.05), accelerator_conf=AcceleratorConfiguration(layout=defaultdict(<class 'int'>, {GPU(name='NVIDIA GeForce RTX 3080', accelerator_type=<AcceleratorType.Discrete: AliasedName(name='Discrete', aliases=(), patterns=())>, vram=Memory(quantity=10.0, byte_suffix=<ByteSuffix.GiB: (1024, 3)>, _num_bytes=10737418240), max_power_limit=430.0, pci_id='0x220610DE', compute_sm=86): 1})), numa_conf=NUMAConfiguration(numa_nodes={}, num_numa_nodes=1), system_id='mahmood2022')
tensor_path : build/preprocessed_data/rnnt_dev_clean_512/fp16
use_graphs : True
config_name : mahmood2022_rnnt_Offline
config_ver : custom_k_99_MaxP
accuracy_level : 99%
optimization_level : plugin-enabled
inference_server : custom
system_id : mahmood2022
use_cpu : False
use_inferentia : False
power_limit : None
cpu_freq : None
test_mode : PerformanceOnly
fast : True
openvino_version : f2f281e6
gpu_num_bundles : 2
log_dir : /work/build/logs/2022.11.13-12.51.03
&&&& RUNNING RNN-T_Harness # ./build/bin/harness_rnnt
I1113 12:51:06.300426 210625 main_rnnt.cc:2903] Found 1 GPUs
[I] Starting creating QSL.
[I] Finished creating QSL.
[I] Starting creating SUT.
[I] Set to device 0
Dali pipeline creating..
Dali pipeline created
[I] Creating stream 0/1
[I] [TRT] [MemUsageChange] Init CUDA: CPU +530, GPU +0, now: CPU 971, GPU 2784 (MiB)
[I] [TRT] Loaded engine size: 81 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1239, GPU +348, now: CPU 2394, GPU 3134 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +177, GPU +56, now: CPU 2571, GPU 3190 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2598, GPU 3249 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2598, GPU 3257 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +232, now: CPU 0, GPU 232 (MiB)
[I] Created RnntEncoder runner: encoder
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2598, GPU 3491 (MiB)
[I] [TRT] Loaded engine size: 3 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2604, GPU 3499 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2605, GPU 3509 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 232 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2605, GPU 3513 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2605, GPU 3521 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1, now: CPU 0, GPU 233 (MiB)
[I] Created RnntDecoder runner: decoder
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2605, GPU 3521 (MiB)
[I] [TRT] Loaded engine size: 1 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointFc1 runner: fc1_a
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2605, GPU 3521 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2606, GPU 3529 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2606, GPU 3529 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointFc1 runner: fc1_b
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2606, GPU 3529 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2606, GPU 3537 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2606, GPU 3537 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntJointBackend runner: joint_backend
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2606, GPU 3537 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2606, GPU 3545 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2606, GPU 3555 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2606, GPU 3547 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2606, GPU 3555 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] Created RnntIsel runner: isel
[I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2606, GPU 3555 (MiB)
[I] [TRT] Loaded engine size: 0 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 234 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 0, GPU 236 (MiB)
[I] Created RnntIgather runner: igather
[I] Instantiated RnntEngineContainer runner
cudaMemcpy blocking 
cudaMemcpy blocking 
[I] Instantiated RnntTensorContainer host memory
Stream::Stream sampleSize: 61440
Stream::Stream singleSampleSize: 480
Stream::Stream fullseqSampleSize: 61440
Stream::Stream mBatchSize: 16
[I] Finished creating SUT.
[I] Starting warming up SUT.
[I] Finished warming up SUT.
[I] Starting running actual test.
Segmentation fault (core dumped)
Traceback (most recent call last):
  File "code/main.py", line 303, in handle_run_harness
    result = harness.run_harness()
  File "/work/code/common/harness.py", line 264, in run_harness
    output = run_command(cmd, get_output=True, custom_env=self.env_vars)
  File "/work/code/common/__init__.py", line 64, in run_command
    raise subprocess.CalledProcessError(ret, cmd)
subprocess.CalledProcessError: Command './build/bin/harness_rnnt --logfile_outdir="/work/build/logs/2022.11.13-12.51.03/mahmood2022_TRT/rnnt/Offline" --logfile_prefix="mlperf_log_" --performance_sample_count=2513 --test_mode="PerformanceOnly" --audio_batch_size=256 --audio_buffer_num_lines=4096 --dali_batches_issue_ahead=4 --dali_pipeline_depth=4 --num_warmups=512 --mlperf_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/mlperf.conf" --user_conf_path="measurements/mahmood2022_TRT/rnnt/Offline/user.conf" --batch_size=16 --cuda_graph=true --pipelined_execution=true --batch_sorting=true --enable_audio_processing=true --use_copy_kernel=true --streams_per_gpu=1 --audio_fp16_input=true --start_from_device=false --audio_serialized_pipeline_file="build/bin/dali/dali_pipeline_gpu_fp16.pth" --scenario Offline --model rnnt --engine_dir="./build/engines/mahmood2022/rnnt/Offline"' returned non-zero exit status 139.
Traceback (most recent call last):
  File "code/main.py", line 772, in <module>
    main(main_args, DETECTED_SYSTEM)
  File "code/main.py", line 744, in main
    dispatch_action(main_args, config_dict, workload_id, equiv_engine_setting=equiv_engine_setting)
  File "code/main.py", line 574, in dispatch_action
    handle_run_harness(benchmark_conf, need_gpu, need_dla, profile, power)
  File "code/main.py", line 312, in handle_run_harness
    raise RuntimeError("Run harness failed!")
RuntimeError: Run harness failed!
make: *** [Makefile:713: run_harness] Error 1
Generating '/tmp/nsys-report-4398.qdstrm'
[1/1] [========================100%] nsys_rnnt.nsys-rep
Generated:
    /work/nsys_rnnt.nsys-rep

I also ran the RNNT without nsys (maybe not related to this forum, but can help) and it finishes normally printing the accuracy times. Please see this shared folder at drive to access the hardware log files and the nsys report.