Decrese in throughput in lower batch size for spare model

Hello
We are using Jetson AGX Orin 64 GB with Jetpack 5.0.2 along with TensorRT 8.4.1. We are trying to reproduce the results of this blog (Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT | NVIDIA Technical Blog). We are getting expected results from batch 4 to 256 but for batch 1 & 2, throughput is decreasing for sparse model compare to dense model. We are not been able to understand this behavior.
Can anyone help in understanding this behaviour?

Thanks for your question.
I’m moving your topic to the Jetson Orin board.

Hi,

Just want to confirm first.
Have you maximized the device’s performance?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

Yes, We selected the MAX power mode i.e. MAXN. I have attached the screenshot for reference as well.
Also, we run all the batch sizes in the same power mode. Throughput is decreased only in batches 1 and 2.

Hi,

Have you also run the jetson_clocks script?
The script will fix the processor’s clock to the maximum.

Without running the script, the default is the dynamic clock mode.
In dynamic mode, performance might vary according to the workload.

Thanks.