Hi, I cannot reproduce with your onnx file and cal.bin.
Because 30 * bs1's GPU Compute Time
is larger than bs30's GPU Compute Time
.
fp32_bs1:
[08/15/2024-05:04:09] [I] Throughput: 68.9179 qps
[08/15/2024-05:04:09] [I] Latency: min = 15.5417 ms, max = 16.423 ms, mean = 15.8086 ms, median = 15.7469 ms, percentile(99%) = 16.3875 ms
[08/15/2024-05:04:09] [I] End-to-End Host Latency: min = 28.3966 ms, max = 29.0903 ms, mean = 28.7187 ms, median = 28.7378 ms, percentile(99%) = 29.0209 ms
[08/15/2024-05:04:09] [I] Enqueue Time: min = 1.17188 ms, max = 1.43268 ms, mean = 1.24283 ms, median = 1.2316 ms, percentile(99%) = 1.42371 ms
[08/15/2024-05:04:09] [I] H2D Latency: min = 1.25488 ms, max = 2.12689 ms, mean = 1.35963 ms, median = 1.28418 ms, percentile(99%) = 1.89673 ms
[08/15/2024-05:04:09] [I] GPU Compute Time: min = 14.2705 ms, max = 14.6831 ms, mean = 14.4403 ms, median = 14.4507 ms, percentile(99%) = 14.6278 ms
[08/15/2024-05:04:09] [I] D2H Latency: min = 0.00732422 ms, max = 0.0109863 ms, mean = 0.00869192 ms, median = 0.00854492 ms, percentile(99%) = 0.0107422 ms
[08/15/2024-05:04:09] [I] Total Host Walltime: 3.04711 s
[08/15/2024-05:04:09] [I] Total GPU Compute Time: 3.03247 s
fp32_bs30:
[08/15/2024-05:39:28] [I] Throughput: 2.44966 qps
[08/15/2024-05:39:28] [I] Latency: min = 396.667 ms, max = 402.534 ms, mean = 399.33 ms, median = 399.164 ms, percentile(99%) = 402.534 ms
[08/15/2024-05:39:28] [I] End-to-End Host Latency: min = 738.49 ms, max = 747.976 ms, mean = 742.091 ms, median = 742.102 ms, percentile(99%) = 747.976 ms
[08/15/2024-05:39:28] [I] Enqueue Time: min = 1.05322 ms, max = 1.22046 ms, mean = 1.18385 ms, median = 1.19608 ms, percentile(99%) = 1.22046 ms
[08/15/2024-05:39:28] [I] H2D Latency: min = 27.3975 ms, max = 28.3652 ms, mean = 28.0859 ms, median = 28.1329 ms, percentile(99%) = 28.3652 ms
[08/15/2024-05:39:28] [I] GPU Compute Time: min = 368.506 ms, max = 374.238 ms, mean = 371.228 ms, median = 371.267 ms, percentile(99%) = 374.238 ms
[08/15/2024-05:39:28] [I] D2H Latency: min = 0.0136719 ms, max = 0.0184326 ms, mean = 0.016394 ms, median = 0.0162354 ms, percentile(99%) = 0.0184326 ms
[08/15/2024-05:39:28] [I] Total Host Walltime: 4.0822 s
[08/15/2024-05:39:28] [I] Total GPU Compute Time: 3.71228 s
fp16_bs1:
[08/15/2024-06:03:34] [I] Throughput: 143.516 qps
[08/15/2024-06:03:34] [I] Latency: min = 7.94495 ms, max = 8.11273 ms, mean = 8.02031 ms, median = 8.02319 ms, percentile(99%) = 8.10742 ms
[08/15/2024-06:03:34] [I] End-to-End Host Latency: min = 13.614 ms, max = 13.9358 ms, mean = 13.788 ms, median = 13.8126 ms, percentile(99%) = 13.9253 ms
[08/15/2024-06:03:34] [I] Enqueue Time: min = 1.01782 ms, max = 1.28809 ms, mean = 1.03474 ms, median = 1.02502 ms, percentile(99%) = 1.21899 ms
[08/15/2024-06:03:34] [I] H2D Latency: min = 1.02673 ms, max = 1.10205 ms, mean = 1.06123 ms, median = 1.06006 ms, percentile(99%) = 1.09723 ms
[08/15/2024-06:03:34] [I] GPU Compute Time: min = 6.86386 ms, max = 7.02368 ms, mean = 6.95004 ms, median = 6.96167 ms, percentile(99%) = 7.01746 ms
[08/15/2024-06:03:34] [I] D2H Latency: min = 0.00708008 ms, max = 0.0106201 ms, mean = 0.00903714 ms, median = 0.0090332 ms, percentile(99%) = 0.0104675 ms
[08/15/2024-06:03:34] [I] Total Host Walltime: 3.02406 s
[08/15/2024-06:03:34] [I] Total GPU Compute Time: 3.01632 s
fp16_bs30:
[08/15/2024-06:48:33] [I] Throughput: 5.4144 qps
[08/15/2024-06:48:33] [I] Latency: min = 203.022 ms, max = 205.748 ms, mean = 204.117 ms, median = 204.081 ms, percentile(99%) = 205.748 ms
[08/15/2024-06:48:33] [I] End-to-End Host Latency: min = 349.855 ms, max = 353.964 ms, mean = 351.773 ms, median = 351.872 ms, percentile(99%) = 353.964 ms
[08/15/2024-06:48:33] [I] Enqueue Time: min = 1.13431 ms, max = 1.25342 ms, mean = 1.22334 ms, median = 1.224 ms, percentile(99%) = 1.25342 ms
[08/15/2024-06:48:33] [I] H2D Latency: min = 27.4159 ms, max = 28.335 ms, mean = 28.1219 ms, median = 28.1774 ms, percentile(99%) = 28.335 ms
[08/15/2024-06:48:33] [I] GPU Compute Time: min = 174.941 ms, max = 177.789 ms, mean = 175.977 ms, median = 175.845 ms, percentile(99%) = 177.789 ms
[08/15/2024-06:48:33] [I] D2H Latency: min = 0.0136719 ms, max = 0.0204163 ms, mean = 0.0183945 ms, median = 0.0184937 ms, percentile(99%) = 0.0204163 ms
[08/15/2024-06:48:33] [I] Total Host Walltime: 3.69385 s
[08/15/2024-06:48:33] [I] Total GPU Compute Time: 3.51954 s
int8_bs1
[08/15/2024-08:16:28] [I] Throughput: 182.769 qps
[08/15/2024-08:16:28] [I] Latency: min = 6.36328 ms, max = 7.59448 ms, mean = 6.40866 ms, median = 6.38892 ms, percentile(99%) = 6.59973 ms
[08/15/2024-08:16:28] [I] End-to-End Host Latency: min = 10.1303 ms, max = 11.1544 ms, mean = 10.812 ms, median = 10.8091 ms, percentile(99%) = 11.0181 ms
[08/15/2024-08:16:28] [I] Enqueue Time: min = 0.553101 ms, max = 1.22742 ms, mean = 1.02463 ms, median = 1.01935 ms, percentile(99%) = 1.10059 ms
[08/15/2024-08:16:28] [I] H2D Latency: min = 0.916016 ms, max = 2.12402 ms, mean = 0.935331 ms, median = 0.918274 ms, percentile(99%) = 1.08249 ms
[08/15/2024-08:16:28] [I] GPU Compute Time: min = 5.43848 ms, max = 5.76923 ms, mean = 5.45895 ms, median = 5.45587 ms, percentile(99%) = 5.61865 ms
[08/15/2024-08:16:28] [I] D2H Latency: min = 0.00756836 ms, max = 0.0231934 ms, mean = 0.0143764 ms, median = 0.0142822 ms, percentile(99%) = 0.0178223 ms
[08/15/2024-08:16:28] [I] Total Host Walltime: 3.01473 s
[08/15/2024-08:16:28] [I] Total GPU Compute Time: 3.00788 s
int8_bs30
[08/15/2024-07:43:25] [I] Throughput: 7.78976 qps
[08/15/2024-07:43:25] [I] Latency: min = 150.749 ms, max = 152.283 ms, mean = 151.657 ms, median = 151.681 ms, percentile(99%) = 152.283 ms
[08/15/2024-07:43:25] [I] End-to-End Host Latency: min = 245.879 ms, max = 248.016 ms, mean = 247.156 ms, median = 247.399 ms, percentile(99%) = 248.016 ms
[08/15/2024-07:43:25] [I] Enqueue Time: min = 0.884354 ms, max = 1.23975 ms, mean = 1.19224 ms, median = 1.20685 ms, percentile(99%) = 1.23975 ms
[08/15/2024-07:43:25] [I] H2D Latency: min = 27.8704 ms, max = 28.1263 ms, mean = 27.9991 ms, median = 28.0181 ms, percentile(99%) = 28.1263 ms
[08/15/2024-07:43:25] [I] GPU Compute Time: min = 122.676 ms, max = 124.241 ms, mean = 123.632 ms, median = 123.682 ms, percentile(99%) = 124.241 ms
[08/15/2024-07:43:25] [I] D2H Latency: min = 0.0141602 ms, max = 0.0311279 ms, mean = 0.0262346 ms, median = 0.0266724 ms, percentile(99%) = 0.0311279 ms
[08/15/2024-07:43:25] [I] Total Host Walltime: 3.33771 s
[08/15/2024-07:43:25] [I] Total GPU Compute Time: 3.21443 s
I run on one A40 machine.