Orin mlperf result

olderHunter · May 1, 2022, 10:46am

I run the mlperf test(inference_results_v2.0/closed/NVIDIA at master · mlcommons/inference_results_v2.0 · GitHub) on orin.
I got resnet50 result as following:

================================================
MLPerf Results Summary

SUT name : LWIS_Server
Scenario : Offline
Mode : PerformanceOnly
Samples per second: 4719.53
Result is : VALID
Min duration satisfied : Yes
Min queries satisfied : Yes
Early stopping satisfied: Yes

================================================
Additional Stats

Min latency (ns) : 63788073
Max latency (ns) : 797114066115
Mean latency (ns) : 398299385560
50.00 percentile latency (ns) : 398102674467
90.00 percentile latency (ns) : 717305385860
95.00 percentile latency (ns) : 757208441537
97.00 percentile latency (ns) : 773202000660
99.00 percentile latency (ns) : 789172108789
99.90 percentile latency (ns) : 796318428604

================================================
Test Parameters Used

samples_per_query : 3762000
target_qps : 5700
target_latency (ns): 0
max_async_queries : 1
min_duration (ms): 600000
max_duration (ms): 0
min_query_count : 1
max_query_count : 0
qsl_rng_seed : 6655344265603136530
sample_index_rng_seed : 15863379492028895792
schedule_rng_seed : 12662793979680847247
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 2048

No warnings encountered during test.

No errors encountered during test.
Finished running actual test.
Device Device:0 processed:
11952 batches of size 256
Memcpy Calls: 373504
PerSampleCudaMemcpy Calls: 0
BatchedCudaMemcpy Calls: 0
Device Device:0.DLA-0 processed:
43913 batches of size 8
Memcpy Calls: 0
PerSampleCudaMemcpy Calls: 0
BatchedCudaMemcpy Calls: 0
Device Device:0.DLA-1 processed:
43873 batches of size 8
Memcpy Calls: 0
PerSampleCudaMemcpy Calls: 0
BatchedCudaMemcpy Calls: 0
&&&& PASSED Default_Harness # ./build/bin/harness_default
[2022-05-01 18:12:46,151 main.py:304 INFO] Result: result_samples_per_second: 4719.53, Result is VALID

======================= Perf harness results: =======================

Orin_TRT-lwis_k_99_MaxP-Offline:
resnet50: result_samples_per_second: 4719.53, Result is VALID

But the I found renset50 result should be 6138.84 fps, How can I get this result?

github.com

mlcommons/inference_results_v2.0/blob/master/closed/NVIDIA/results/Orin_TRT/resnet50/Offline/performance/run_1/mlperf_log_summary.txt

================================================
MLPerf Results Summary
================================================
SUT name : LWIS_Server
Scenario : Offline
Mode     : PerformanceOnly
Samples per second: 6138.84
Result is : VALID
  Min duration satisfied : Yes
  Min queries satisfied : Yes
  Early stopping satisfied: Yes

================================================
Additional Stats
================================================
Min latency (ns)                : 68588034
Max latency (ns)                : 655824286123
Mean latency (ns)               : 327955003598
50.00 percentile latency (ns)   : 327961769745
90.00 percentile latency (ns)   : 590265725104

This file has been truncated. show original

olderHunter · May 1, 2022, 10:50am

details:
pc@pc-desktop:~/mlperf/inference_results_v2.0/closed/NVIDIA$
make run_harness RUN_ARGS=“–benchmarks=resnet50 --scenarios=offline --test_mode=PerformanceOnly”
[2022-05-01 17:59:18,900 main.py:770 INFO] Detected System ID: KnownSystem.Orin

[2022-05-01 17:59:19,429 main.py:249 INFO] Running harness for resnet50 benchmark in Offline scenario…

[2022-05-01 17:59:19,437 init.py:43 INFO] Running command: ./build/bin/harness_default --logfile_outdir=“/home/pc/mlperf/inference_results_v2.0/closed/NVIDIA/build/logs/2022.05.01-17.59.16/Orin_TRT/resnet50/Offline” --logfile_prefix=“mlperf_log_” --performance_sample_count=2048 --test_mode=“PerformanceOnly” --dla_batch_size=8 --dla_copy_streams=2 --dla_inference_streams=1 --gpu_copy_streams=2 --gpu_inference_streams=1 --use_direct_host_access=true --gpu_batch_size=256 --map_path=“data_maps/imagenet/val_map.txt” --tensor_path=“build/preprocessed_data/imagenet/ResNet50/int8_linear” --use_graphs=false --gpu_engines=“./build/engines/Orin/resnet50/Offline/resnet50-Offline-gpu-b256-int8.lwis_k_99_MaxP.plan” --mlperf_conf_path=“measurements/Orin_TRT/resnet50/Offline/mlperf.conf” --user_conf_path=“measurements/Orin_TRT/resnet50/Offline/user.conf” --dla_engines=“./build/engines/Orin/resnet50/Offline/resnet50-Offline-dla-b8-int8.lwis_k_99_MaxP.plan” --scenario Offline --model resnet50

[2022-05-01 17:59:19,437 init.py:50 INFO] Overriding Environment

benchmark : Benchmark.ResNet50

dla_batch_size : 8

dla_copy_streams : 2

dla_core : 0

dla_inference_streams : 1

gpu_batch_size : 256

gpu_copy_streams : 2

gpu_inference_streams : 1

input_dtype : int8

input_format : linear

map_path : data_maps/imagenet/val_map.txt

offline_expected_qps : 5700

precision : int8

scenario : Scenario.Offline

system : SystemConfiguration(host_cpu_conf=CPUConfiguration(layout={CPU(name=‘ARMv8 Processor rev 1 (v8l)’, architecture=<CPUArchitecture.aarch64: AliasedName(name=‘aarch64’, aliases=(), patterns=())>, core_count=4, threads_per_core=1): 3}), host_mem_conf=MemoryConfiguration(host_memory_capacity=Memory(quantity=31.357616, byte_suffix=<ByteSuffix.GB: (1000, 3)>, _num_bytes=31357616000), comparison_tolerance=0.05), accelerator_conf=AcceleratorConfiguration(layout=defaultdict(<class ‘int’>, {GPU(name=‘NVIDIA Orin Jetson-Small Developer Kit’, accelerator_type=<AcceleratorType.Integrated: AliasedName(name=‘Integrated’, aliases=(), patterns=())>, vram=None, max_power_limit=None, pci_id=None, compute_sm=87): 1})), numa_conf=None, system_id=‘Orin’)

tensor_path : build/preprocessed_data/imagenet/ResNet50/int8_linear

use_direct_host_access : True

use_graphs : False

config_name : Orin_resnet50_Offline

config_ver : lwis_k_99_MaxP

accuracy_level : 99%

optimization_level : plugin-enabled

inference_server : lwis

system_id : Orin

use_cpu : False

use_inferentia : False

soc_gpu_freq : None

soc_dla_freq : None

soc_cpu_freq : None

soc_emc_freq : None

orin_num_cores : None

test_mode : PerformanceOnly

openvino_version : f2f281e6

gpu_num_bundles : 2

log_dir : /home/pc/mlperf/inference_results_v2.0/closed/NVIDIA/build/logs/2022.05.01-17.59.16

&&&& RUNNING Default_Harness # ./build/bin/harness_default

[I] mlperf.conf path: measurements/Orin_TRT/resnet50/Offline/mlperf.conf

[I] user.conf path: measurements/Orin_TRT/resnet50/Offline/user.conf

Creating QSL.

Finished Creating QSL.

Setting up SUT.

[I] [TRT] [MemUsageChange] Init CUDA: CPU +283, GPU +0, now: CPU 325, GPU 7884 (MiB)

[I] [TRT] Loaded engine size: 26 MiB

[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +825, now: CPU 908, GPU 8778 (MiB)

[I] [TRT] [MemUsageChange] Init cuDNN: CPU +84, GPU +134, now: CPU 992, GPU 8912 (MiB)

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +24, now: CPU 0, GPU 24 (MiB)

[I] Device:0: ./build/engines/Orin/resnet50/Offline/resnet50-Offline-gpu-b256-int8.lwis_k_99_MaxP.plan has been successfully loaded.

[I] [TRT] Loaded engine size: 25 MiB

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +25, GPU +0, now: CPU 25, GPU 24 (MiB)

[I] Device:0.DLA-0: ./build/engines/Orin/resnet50/Offline/resnet50-Offline-dla-b8-int8.lwis_k_99_MaxP.plan has been successfully loaded.

[I] [TRT] Loaded engine size: 25 MiB

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +26, GPU +0, now: CPU 51, GPU 24 (MiB)

[I] Device:0.DLA-1: ./build/engines/Orin/resnet50/Offline/resnet50-Offline-dla-b8-int8.lwis_k_99_MaxP.plan has been successfully loaded.

[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1018, GPU 8958 (MiB)

[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +4, now: CPU 1018, GPU 8962 (MiB)

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +392, now: CPU 51, GPU 416 (MiB)

[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +7, now: CPU 1030, GPU 9415 (MiB)

[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1030, GPU 9425 (MiB)

[I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +392, now: CPU 51, GPU 808 (MiB)

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 51, GPU 810 (MiB)

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1, now: CPU 51, GPU 811 (MiB)

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 51, GPU 813 (MiB)

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1, now: CPU 51, GPU 814 (MiB)

[I] Creating batcher thread: 0 EnableBatcherThreadPerDevice: false

Finished setting up SUT.

Starting warmup. Running for a minimum of 5 seconds.

Finished warmup. Ran for 5.63439s.

Starting running actual test.

================================================

MLPerf Results Summary

================================================

SUT name : LWIS_Server

Scenario : Offline

Mode : PerformanceOnly

Samples per second: 4719.53

Result is : VALID

Min duration satisfied : Yes

Min queries satisfied : Yes

Early stopping satisfied: Yes

================================================

Additional Stats

================================================

Min latency (ns) : 63788073

Max latency (ns) : 797114066115

Mean latency (ns) : 398299385560

50.00 percentile latency (ns) : 398102674467

90.00 percentile latency (ns) : 717305385860

95.00 percentile latency (ns) : 757208441537

97.00 percentile latency (ns) : 773202000660

99.00 percentile latency (ns) : 789172108789

99.90 percentile latency (ns) : 796318428604

================================================

Test Parameters Used

================================================

samples_per_query : 3762000

target_qps : 5700

target_latency (ns): 0

max_async_queries : 1

min_duration (ms): 600000

max_duration (ms): 0

min_query_count : 1

max_query_count : 0

qsl_rng_seed : 6655344265603136530

sample_index_rng_seed : 15863379492028895792

schedule_rng_seed : 12662793979680847247

accuracy_log_rng_seed : 0

accuracy_log_probability : 0

accuracy_log_sampling_target : 0

print_timestamps : 0

performance_issue_unique : 0

performance_issue_same : 0

performance_issue_same_index : 0

performance_sample_count : 2048

No warnings encountered during test.

No errors encountered during test.

Finished running actual test.

Device Device:0 processed:

11952 batches of size 256

Memcpy Calls: 373504

PerSampleCudaMemcpy Calls: 0

BatchedCudaMemcpy Calls: 0

Device Device:0.DLA-0 processed:

43913 batches of size 8

Memcpy Calls: 0

PerSampleCudaMemcpy Calls: 0

BatchedCudaMemcpy Calls: 0

Device Device:0.DLA-1 processed:

43873 batches of size 8

Memcpy Calls: 0

PerSampleCudaMemcpy Calls: 0

BatchedCudaMemcpy Calls: 0

&&&& PASSED Default_Harness # ./build/bin/harness_default

[2022-05-01 18:12:46,151 main.py:304 INFO] Result: result_samples_per_second: 4719.53, Result is VALID

======================= Perf harness results: =======================

Orin_TRT-lwis_k_99_MaxP-Offline:

resnet50: result_samples_per_second: 4719.53, Result is VALID

======================= Accuracy results: =======================

Orin_TRT-lwis_k_99_MaxP-Offline:

resnet50: No accuracy results in PerformanceOnly mode.

dusty_nv · May 3, 2022, 7:05pm

Just a note from the official results https://mlcommons.org/en/inference-edge-20/ (see Notes column for Orin):

GPU and both DLAs are used in resnet50, ssd-mobilenet, and ssd-resnet34, in Offline scenario. DLA loadable for resnet50 and ssd-resnet34 in offline scenario generated using preview compiler. Private harness code to run loadables. git hash: f23b7273986f02d3136673e5d18558c9a9d63799

The exact results will be reproduceable in the future, given future software updates. This is allowed by MLPerf since we submitted Orin in the Preview category.

andy.linluo · May 5, 2022, 6:11am

It is the same for me.

Dont know how to achieve the reported number.

dusty_nv · May 5, 2022, 4:49pm

Hi @andy.linluo, as mentioned above, these results were submitted in the Preview category and the exact software used to reproduce them is still to be released.

system · May 25, 2022, 4:10am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.