(mlperf) mahmood@mlperf-inference-mahmood-x86_64:/work$ /usr/local/NVIDIA-Nsight-Compute-2022.2/nv-nsight-cu-cli --kill on -c 78825 --metrics smsp__inst_executed.sum,gpc__cycles_elapsed.avg -f -o resnet50-new ./build/bin/harness_default --logfile_outdir="/work/build/logs/2023.02.10-15.16.30/mahmood2022_TRT/resnet50/Offline" --logfile_prefix="mlperf_log_" --performance_sample_count=2048 --gpu_copy_streams=2 --gpu_inference_streams=1 --run_infer_on_copy_streams=false --gpu_batch_size=1024 --map_path="data_maps/imagenet/val_map.txt" --tensor_path="build/preprocessed_data/imagenet/ResNet50/int8_linear" --use_graphs=false --gpu_engines="./build/engines/mahmood2022/resnet50/Offline/resnet50-Offline-gpu-b1024-int8.lwis_k_99_MaxP.plan" --mlperf_conf_path="measurements/mahmood2022_TRT/resnet50/Offline/mlperf.conf" --user_conf_path="measurements/mahmood2022_TRT/resnet50/Offline/user.conf" --max_dlas=0 --scenario Offline --model resnet50
&&&& RUNNING Default_Harness # /work/./build/bin/harness_default
[I] mlperf.conf path: measurements/mahmood2022_TRT/resnet50/Offline/mlperf.conf
[I] user.conf path: measurements/mahmood2022_TRT/resnet50/Offline/user.conf
Creating QSL.
==PROF== Connected to process 617 (/work/build/bin/harness_default)
Finished Creating QSL.
Setting up SUT.
[I] [TRT] [MemUsageChange] Init CUDA: CPU +537, GPU +0, now: CPU 560, GPU 939 (MiB)
[I] [TRT] Loaded engine size: 26 MiB
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1246, GPU +342, now: CPU 1857, GPU 1315 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +180, GPU +66, now: CPU 2037, GPU 1381 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +24, now: CPU 0, GPU 24 (MiB)
[I] Device:0: ./build/engines/mahmood2022/resnet50/Offline/resnet50-Offline-gpu-b1024-int8.lwis_k_99_MaxP.plan has been successfully loaded.
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 2011, GPU 1373 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2011, GPU 1381 (MiB)
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1568, now: CPU 0, GPU 1592 (MiB)
[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2038, GPU 3111 (MiB)
[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2038, GPU 3121 (MiB)
[I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1568, now: CPU 0, GPU 3160 (MiB)
[I] Creating batcher thread: 0 EnableBatcherThreadPerDevice: false
Finished setting up SUT.
Starting warmup. Running for a minimum of 5 seconds.
==PROF== Profiling "convActPoolKernelV2" - 0 (1/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "res2_sm_80_kernel" - 1 (2/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 2 (3/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 3 (4/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "conv_kernel_128_sb" - 4 (5/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 5 (6/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 6 (7/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 7 (8/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 8 (9/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 9 (10/78825): 0%....50%....100% - 1 pass
==PROF== Profiling "sm80_xmma_fprop_implicit_gemm..." - 10 (11/78825): 0%....50%....100% - 1 pass