We measured performance (decode rate & prefill rate) according to CPU/GPU frequency in nvpmodel 0. While nvpmodel 1 and nvpmodel 2 showed stable results, nvpmodel 0 initially measured unstable values.
Why does this happen and how can I make it reliable?
Hi @ethda111, which LLM and inferencing API are you using, and are you doing any “warmup” runs that are not counted in the benchmarking numbers? The first run takes longer to run. Also, you can try jetson_clocks (which will disable DVFS for the active nvpmodel)
The model used Llama-2-7b-chat-hf-q4f16_ft from meta-llama.
When I used warmup, I got the following results: It seems to be better than before, but it doesn’t seem to have been completely resolved.
For jetson_clocks, can I try sudo /usr/bin/jetson_clocks
?
Hi @ethda111, yes that should work - and you can run jetson_clocks --store
first to save the defaults, so you can later --restore
them back to the originals without needing to reboot.
Note that jetson_clocks does not increase the maximum frequencies like nvpmodel does, but rather disables DVFS and locks their clocks to the maximum frequency that nvpmodel defines for the active power profile.
Thanks for you reply.
But I didn’t change the whole CPU frequency, just some of it.
I’m sorry that I didn’t take online CPUs into account.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.