Orin mlperf result

I run the mlperf test(inference_results_v2.0/closed/NVIDIA at master · mlcommons/inference_results_v2.0 · GitHub) on orin.
I got resnet50 result as following:

================================================
MLPerf Results Summary

SUT name : LWIS_Server
Scenario : Offline
Mode : PerformanceOnly
Samples per second: 4719.53
Result is : VALID
Min duration satisfied : Yes
Min queries satisfied : Yes
Early stopping satisfied: Yes

================================================
Additional Stats

Min latency (ns) : 63788073
Max latency (ns) : 797114066115
Mean latency (ns) : 398299385560
50.00 percentile latency (ns) : 398102674467
90.00 percentile latency (ns) : 717305385860
95.00 percentile latency (ns) : 757208441537
97.00 percentile latency (ns) : 773202000660
99.00 percentile latency (ns) : 789172108789
99.90 percentile latency (ns) : 796318428604

================================================
Test Parameters Used

samples_per_query : 3762000
target_qps : 5700
target_latency (ns): 0
max_async_queries : 1
min_duration (ms): 600000
max_duration (ms): 0
min_query_count : 1
max_query_count : 0
qsl_rng_seed : 6655344265603136530
sample_index_rng_seed : 15863379492028895792
schedule_rng_seed : 12662793979680847247
accuracy_log_rng_seed : 0
accuracy_log_probability : 0
accuracy_log_sampling_target : 0
print_timestamps : 0
performance_issue_unique : 0
performance_issue_same : 0
performance_issue_same_index : 0
performance_sample_count : 2048

No warnings encountered during test.

No errors encountered during test.
Finished running actual test.
Device Device:0 processed:
11952 batches of size 256
Memcpy Calls: 373504
PerSampleCudaMemcpy Calls: 0
BatchedCudaMemcpy Calls: 0
Device Device:0.DLA-0 processed:
43913 batches of size 8
Memcpy Calls: 0
PerSampleCudaMemcpy Calls: 0
BatchedCudaMemcpy Calls: 0
Device Device:0.DLA-1 processed:
43873 batches of size 8
Memcpy Calls: 0
PerSampleCudaMemcpy Calls: 0
BatchedCudaMemcpy Calls: 0
&&&& PASSED Default_Harness # ./build/bin/harness_default
[2022-05-01 18:12:46,151 main.py:304 INFO] Result: result_samples_per_second: 4719.53, Result is VALID

======================= Perf harness results: =======================

Orin_TRT-lwis_k_99_MaxP-Offline:
resnet50: result_samples_per_second: 4719.53, Result is VALID

But the I found renset50 result should be 6138.84 fps, How can I get this result?

details:
pc@pc-desktop:~/mlperf/inference_results_v2.0/closed/NVIDIA$
make run_harness RUN_ARGS="–benchmarks=resnet50 --scenarios=offline --test_mode=PerformanceOnly"
[2022-05-01 17:59:18,900 main.py:770 INFO] Detected System ID: KnownSystem.Orin

[2022-05-01 17:59:19,429 main.py:249 INFO] Running harness for resnet50 benchmark in Offline scenario…

[2022-05-01 17:59:19,437 init.py:43 INFO] Running command: ./build/bin/harness_default --logfile_outdir="/home/pc/mlperf/inference_results_v2.0/closed/NVIDIA/build/logs/2022.05.01-17.59.16/Orin_TRT/resnet50/Offline" --logfile_prefix=“mlperf_log_” --performance_sample_count=2048 --test_mode=“PerformanceOnly” --dla_batch_size=8 --dla_copy_streams=2 --dla_inference_streams=1 --gpu_copy_streams=2 --gpu_inference_streams=1 --use_direct_host_access=true --gpu_batch_size=256 --map_path=“data_maps/imagenet/val_map.txt” --tensor_path=“build/preprocessed_data/imagenet/ResNet50/int8_linear” --use_graphs=false --gpu_engines="./build/engines/Orin/resnet50/Offline/resnet50-Offline-gpu-b256-int8.lwis_k_99_MaxP.plan" --mlperf_conf_path=“measurements/Orin_TRT/resnet50/Offline/mlperf.conf” --user_conf_path=“measurements/Orin_TRT/resnet50/Offline/user.conf” --dla_engines="./build/engines/Orin/resnet50/Offline/resnet50-Offline-dla-b8-int8.lwis_k_99_MaxP.plan" --scenario Offline --model resnet50

[2022-05-01 17:59:19,437 init.py:50 INFO] Overriding Environment

benchmark : Benchmark.ResNet50

dla_batch_size : 8

dla_copy_streams : 2

dla_core : 0

dla_inference_streams : 1

gpu_batch_size : 256

gpu_copy_streams : 2

gpu_inference_streams : 1

input_dtype : int8

input_format : linear

map_path : data_maps/imagenet/val_map.txt

offline_expected_qps : 5700

precision : int8

scenario : Scenario.Offline

system : SystemConfiguration(host_cpu_conf=CPUConfiguration(layout={CPU(name=‘ARMv8 Processor rev 1 (v8l)’, architecture=<CPUArchitecture.aarch64: AliasedName(name=‘aarch64’, aliases=(), patterns=())>, core_count=4, threads_per_core=1): 3}), host_mem_conf=MemoryConfiguration(host_memory_capacity=Memory(quantity=31.357616, byte_suffix=<ByteSuffix.GB: (1000, 3)>, _num_bytes=31357616000), comparison_tolerance=0.05), accelerator_conf=AcceleratorConfiguration(layout=defaultdict(<class ‘int’>, {GPU(name=‘NVIDIA Orin Jetson-Small Developer Kit’, accelerator_type=<AcceleratorType.Integrated: AliasedName(name=‘Integrated’, aliases=(), patterns=())>, vram=None, max_power_limit=None, pci_id=None, compute_sm=87): 1})), numa_conf=None, system_id=‘Orin’)

tensor_path : build/preprocessed_data/imagenet/ResNet50/int8_linear

use_direct_host_access : True

use_graphs : False

config_name : Orin_resnet50_Offline

config_ver : lwis_k_99_MaxP

accuracy_level : 99%

optimization_level : plugin-enabled

inference_server : lwis

system_id : Orin

use_cpu : False

use_inferentia : False

soc_gpu_freq : None

soc_dla_freq : None

soc_cpu_freq : None

soc_emc_freq : None

orin_num_cores : None

test_mode : PerformanceOnly

openvino_version : f2f281e6

gpu_num_bundles : 2

log_dir : /home/pc/mlperf/inference_results_v2.0/closed/NVIDIA/build/logs/2022.05.01-17.59.16

&&&& RUNNING Default_Harness # ./build/bin/harness_default

[I] mlperf.conf path: measurements/Orin_TRT/resnet50/Offline/mlperf.conf

[I] user.conf path: measurements/Orin_TRT/resnet50/Offline/user.conf

Creating QSL.

Finished Creating QSL.

Setting up SUT.

[I] [TRT] [MemUsageChange] Init CUDA: CPU +283, GPU +0, now: CPU 325, GPU 7884 (MiB)

[I] [TRT] Loaded engine size: 26 MiB

[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +825, now: CPU 908, GPU 8778 (MiB)

[I] [TRT] [MemUsageChange] Init cuDNN: CPU +84, GPU +134, now: CPU 992, GPU 8912 (MiB)

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +24, now: CPU 0, GPU 24 (MiB)

[I] Device:0: ./build/engines/Orin/resnet50/Offline/resnet50-Offline-gpu-b256-int8.lwis_k_99_MaxP.plan has been successfully loaded.

[I] [TRT] Loaded engine size: 25 MiB

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +25, GPU +0, now: CPU 25, GPU 24 (MiB)

[I] Device:0.DLA-0: ./build/engines/Orin/resnet50/Offline/resnet50-Offline-dla-b8-int8.lwis_k_99_MaxP.plan has been successfully loaded.

[I] [TRT] Loaded engine size: 25 MiB

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +26, GPU +0, now: CPU 51, GPU 24 (MiB)

[I] Device:0.DLA-1: ./build/engines/Orin/resnet50/Offline/resnet50-Offline-dla-b8-int8.lwis_k_99_MaxP.plan has been successfully loaded.

[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1018, GPU 8958 (MiB)

[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +4, now: CPU 1018, GPU 8962 (MiB)

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +392, now: CPU 51, GPU 416 (MiB)

[I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +7, now: CPU 1030, GPU 9415 (MiB)

[I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1030, GPU 9425 (MiB)

[I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +392, now: CPU 51, GPU 808 (MiB)

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 51, GPU 810 (MiB)

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1, now: CPU 51, GPU 811 (MiB)

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 51, GPU 813 (MiB)

[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1, now: CPU 51, GPU 814 (MiB)

[I] Creating batcher thread: 0 EnableBatcherThreadPerDevice: false

Finished setting up SUT.

Starting warmup. Running for a minimum of 5 seconds.

Finished warmup. Ran for 5.63439s.

Starting running actual test.

================================================

MLPerf Results Summary

================================================

SUT name : LWIS_Server

Scenario : Offline

Mode : PerformanceOnly

Samples per second: 4719.53

Result is : VALID

Min duration satisfied : Yes

Min queries satisfied : Yes

Early stopping satisfied: Yes

================================================

Additional Stats

================================================

Min latency (ns) : 63788073

Max latency (ns) : 797114066115

Mean latency (ns) : 398299385560

50.00 percentile latency (ns) : 398102674467

90.00 percentile latency (ns) : 717305385860

95.00 percentile latency (ns) : 757208441537

97.00 percentile latency (ns) : 773202000660

99.00 percentile latency (ns) : 789172108789

99.90 percentile latency (ns) : 796318428604

================================================

Test Parameters Used

================================================

samples_per_query : 3762000

target_qps : 5700

target_latency (ns): 0

max_async_queries : 1

min_duration (ms): 600000

max_duration (ms): 0

min_query_count : 1

max_query_count : 0

qsl_rng_seed : 6655344265603136530

sample_index_rng_seed : 15863379492028895792

schedule_rng_seed : 12662793979680847247

accuracy_log_rng_seed : 0

accuracy_log_probability : 0

accuracy_log_sampling_target : 0

print_timestamps : 0

performance_issue_unique : 0

performance_issue_same : 0

performance_issue_same_index : 0

performance_sample_count : 2048

No warnings encountered during test.

No errors encountered during test.

Finished running actual test.

Device Device:0 processed:

11952 batches of size 256

Memcpy Calls: 373504

PerSampleCudaMemcpy Calls: 0

BatchedCudaMemcpy Calls: 0

Device Device:0.DLA-0 processed:

43913 batches of size 8

Memcpy Calls: 0

PerSampleCudaMemcpy Calls: 0

BatchedCudaMemcpy Calls: 0

Device Device:0.DLA-1 processed:

43873 batches of size 8

Memcpy Calls: 0

PerSampleCudaMemcpy Calls: 0

BatchedCudaMemcpy Calls: 0

&&&& PASSED Default_Harness # ./build/bin/harness_default

[2022-05-01 18:12:46,151 main.py:304 INFO] Result: result_samples_per_second: 4719.53, Result is VALID

======================= Perf harness results: =======================

Orin_TRT-lwis_k_99_MaxP-Offline:

resnet50: result_samples_per_second: 4719.53, Result is VALID

======================= Accuracy results: =======================

Orin_TRT-lwis_k_99_MaxP-Offline:

resnet50: No accuracy results in PerformanceOnly mode.

Just a note from the official results https://mlcommons.org/en/inference-edge-20/ (see Notes column for Orin):

GPU and both DLAs are used in resnet50, ssd-mobilenet, and ssd-resnet34, in Offline scenario. DLA loadable for resnet50 and ssd-resnet34 in offline scenario generated using preview compiler. Private harness code to run loadables. git hash: f23b7273986f02d3136673e5d18558c9a9d63799

The exact results will be reproduceable in the future, given future software updates. This is allowed by MLPerf since we submitted Orin in the Preview category.

1 Like

It is the same for me.

Dont know how to achieve the reported number.

Hi @andy.linluo, as mentioned above, these results were submitted in the Preview category and the exact software used to reproduce them is still to be released.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.