We are trying to upgrade the OS of our product running on Jetson Xavier NX, from JetPack 4.4.1 to JetPack 5.1.2.
We’ve encountered a problematic behavior - the GPU seems to run dramatically slower on the new JetPack.
I’ve tried several different kinds of operators - from simple torch cuda operations, to tensorrt engine inference using torch_tensorrt.
I’ve run each operator 10,000 times and measured average timing in ms per frame.
Here is a table of the results:
It seems that all GPU operators are taking much more time on the JetPack5…
The JetPack 5 has, of course, suitable versions:
- Python 3.8 (in JP4: 3.6)
- Torch 2.1.0 with cuda 11.4 (in JP4: Torch 1.8.0 with cuda 10.2)
- TensorRT 8.5.2 (in JP4: 7.1.3)
- torch_tensorrt 1.4.0 (compiled) (in JP4: trtorch 0.3.0)
- Opencv 4.6.0 with cuda 11.4 (compiled) (in JP4: 4.6.0 with cuda 10.2)
Things I’ve checked:
- Clocks seem the same (as far as I can tell, from all data appears in jtop).
- jetson_clocks is running.
- Power mode is the same (15W 6 cores)
Here is a sample script (results are in the last line of the table I’ve attached, about 5% slower on JetPack 5):
import torch
import time
ITERATIONS = 10000
torch_gpu_tensor = torch.ones((10000, 10000), dtype=torch.float32, device=torch.device('cuda'))
torch_gpu_tensor.sum() # First call sometimes slower, do not count
start_time = time.time()
for _ in range(ITERATIONS):
torch_gpu_tensor.sum()
torch.cuda.current_stream().synchronize()
total_time = time.time() - start_time
print(f"it took {total_time} seconds for {ITERATIONS} iterations, {1000 * total_time / ITERATIONS}ms average for 1 iteration")
Can you please advise us on how to proceed?
Thank you.
Hi,
Due to the security hardening in upstream kernel 5.10, we do expect to see the general perf drop in JetPack 5.
Thanks.
How can we disable these security hardening? We can not work with such latency in our product, we are time critical.
We actually wanted to benefit from the DLA and additional supported features in a newer JetPack, and it is absurd that we actually get so bad performance instead.
Hi,
Unfortunately, this comes from the upstream kernel so it cannot be removed.
But maybe you can tune a custom power mode for 15W to see if it helps.
Ex. lower the CPU clocks and increases the GPU clocks
Thanks.
Hi,
To reproduce this issue locally, could you share the following with us?
- The sample script for “gpu post process”
- The building step (or wheel) for PyTorch on JetPack 4.4.1 and 5.1.2.
- The building step (or wheel) for torch_tensorrt on JetPack 4.4.1 and 5.1.2.
- The building step (or wheel) for OpenCV on JetPack 4.4.1 and 5.1.2.
Thanks.
Hi,
Would you mind sharing the info above so we can reproduce this issue internally?
Thanks.
Hi AastaLLL, thank you for helping.
I’m attaching a sample script of inference + gpu post process. This is not exactly the model we’ve trained, but a regular mobilenet_v2 that also reproduces the issue on my devices (it is simpler and also does not include any IP of the company). I am including the ONNX of the model and the conversion for both platforms (JP4.4.1 and JP5.1.2) + the wheels for torch, torchvision and trtorch/teorch_tensorrt.
Script + wheels can be found here:
https://drive.google.com/file/d/1ozaFlIYFkznzu0tgSqJ4eHsVMNSS3vol/view?usp=sharing
The gpu post process seem to be slower “only” in around 10% since I’ve re-converted the models to JP5, I don’t know exactly why.
The inference seem to be around 28% slower on our Jetsons.
Regarding the OpenCV, it includes some internal code modifications so it is a bit more complicated to share, anyway the sample script does not relate on it.
I hope you can reproduce the latency, otherwise it might be something wrong in the environment setup we’ve created…
For instance, we are installing JetPack on nvme device and replacing the GUI with LXDE (in both JetPacks), can this be related? Are there any other dependencies of the torch/tensorrt that might be different on our environment?
Thank you again.