Performance differences between 16G and 32G AGX Xavier

Why is there so more GPU usage using Xavier 32 Gb compared to 16 Gb?

We’ve developed an application that grabs 4 cameras using a Leopard Imaging board and Sony cameras. Our application was developed on 16G running JP 4.3. There everything works fine. However when we install the same software on a 32Gb Xavier, the solutions GPU usage goes up and the system needs much more power (47.5 to>60). The relevant tegrastats outputs are attached to this message. A reasonable example of the most important parts of both stats is:

  • stats16G.log (53.5 KB)
    RAM 6660/15823MB (lfb 2168x4MB) EMC_FREQ 78%@2133 GPU 8945/8945 CPU 2577/2577 SOC 13948/13948 VDDRQ 3485/3485 SYS5V 3796/3796
  • stats32G.log (53.7 KB)
    RAM 6712/31927MB (lfb 6183x4MB) EMC_FREQ 82%@2133 GPU 9621/9621 CPU 3457/3457 SOC 14426/14426 VDDRQ 4205/4205 SYS5V 3947/3947

The attached stats have the full output.

This structural difference in GPU of about 800-1000 points and similarly increased power usage worries us. In addition to that, we run into issues with JP4.3 and 4.4 on 32GB systems. Where the 16GB system runs for long periods of time without issue, the 32GB variant crashes and subsequently reboots after running for 5-23 minutes.

So to summarize we have a couple of questions:

  • Why does the 32GB version use so much more GPU?
  • Does the increased GPU usage account for all the increased power usage, or are there more factors in play?
  • What can cause the crashes/reboots that we encounter on a 32GB Xavier, but not on a 16GB version?

Hi,
The GPU loading should be

GR3D_FREQ 61%@1377

Don’t see much difference in attached 16G and 32G log. It may help to use CUDA tool:
https://developer.nvidia.com/embedded/develop/tools
NVIDIA Visual Profiler and nvprof

Please give it a try and see if there is more clues.

Hi,

If the GPU load is signified by what seems to be a frequency, what does the GPU field mean exactly?

I extracted the GR3D_FREQ field from these (admittedly short) logs for plotting, and got:


So from this graph it seems obvious that the GPU usage on the 32G version is different. Me profiling the application, while I already provided you with the information that we have the exact same software running on both machines, seems pointless.

I can produce longer measurements with higher report frequencies, and I can produce graphs with running means, if any of that would help.

Thanks.

Hi,
Do you see better performance in using Xavier 32G? If GPU loading is a bit higher, there should be slightly better performance.

I would have to figure out a way to reliably measure that performance has in fact increased. What I do see on the 32G devices is performance degrading over time and subsequently parts of the device or software giving up. There seems to be a relation, but I can’t put my finger to it yet.

Hi,
We have a tool Jetson Power Estimator
Please give it a try and share the result. Also there are reference samples:

/usr/src/nvidia/graphics_demos

Since we don’t have your application, if either reference sample can be run to replicate the issue, please let us know and we can try to replicate it.

We’ve identified the root cause of our issues. Here’s a statement from our CEO on it.

We have found the cause of the problem. Before we share it with you, we need to have a meeting with NVIDIA. They introduced something that kills performance, generates more power usage and as a result more heat. Also our report down here describing our problems and differences between 16 and 32GB Xaviers should have immediately alarmed people within NVIDIA who know what they made. They are obviously not active on this forum which is a shame. Due to this “feature” by NVIDIA, that unexpectedly showed up unfavourably in the 32GB Xaviers, we had to undertake a massive operation with our customers which did cost a lot of money and actually led to customers not want to work with us anymore.

We do understand NVIDIA is not liable, but Xaviers are not free of charge and should go with better support. So as soon as we have been in contact with NVIDIA and receive a satisfactory answer we will update this thread.

Hi,
We would like to have some information. Could you please share which system software components you found the issue? So that we can have the teams/experts to have initial investigation. If it is fine, please share the information, thanks.

@fransklaver Are you able to elaborate at all on this issue and the findings your CEO alluded to? We have noticed degraded performance on our 32Gb Xaviers as well vs the performance we saw on our 16Gb models. Any suggestions would be appreciated.

I have not confirmed that this relates to the performance issues, but I did notice the following new feature of the 32Gb AGX Xavier:

Includes 32 GB of memory with support for ECC

Reference: Jetson Product Updates

And the following regarding ECC on Jetson TX2i:

Jetson TX2i supports DRAM ECC, and DRAM ECC is enabled by default. The section To disable ECC explains how to disable this feature.

ECC (Error Correcting Code) Memory
Memory bandwidth is reduced when ECC is enabled. ECC reserves 12.5% of memory for parity bits, causing a 12.5% reduction in memory bandwidth.

Reference: “Jetson Module Support” section of the NVIDIA Jetson Linux Developer Guide

This does not confirm that such an issue exists on the 32Gb AGX Xavier, but I would not rule out the possibility without Nvidia saying one way or another.

The developer guide documents steps for disabling ECC on Jetson TX2i, but does not provide same for AGX Xavier.

I also noticed this feature of the 32Gb AGX Xavier:

Designed for safety with internal Safety Cluster Engine (SCE): a dedicated ARM® Cortex®-R5F subsystem for integrated fault detection mechanisms, lock-step subsystems, and in-field self tests

However, I see nothing to suggest that this would result in performance degradation.

Our application is not a safety-critical system, so we would prefer to disable these features if possible if they cause significant performance loss. @DaneLLL could you offer any guidance here? If necessary, I can make a new forum thread for this. Thank you for your time!

Hi @georgem,

Are you able to elaborate at all on this issue and the findings your CEO alluded to?

I am now.

I’m not sure if ECC is involved. We discovered that disabling nvzramconfig removes the performance issues and crashes for us. Power consumption went down, and gpu utilization seemed to go down as well, although I don’t have definitive measurements on the latter.

We do push a lot of data through, so it might have been a combination of a large zram partition and ECC or something else entirely. NVidia say they are going to look into this further. We are content with disabling nvzramconfig.

Hope this helps.

@fransklaver thank you for sharing your findings!

Disabling nvzramconfig on our 32Gb devices did improve the performance of our application. The performance is still somewhat degraded, but nearly back to acceptable levels.

To follow up about my concern regarding ECC memory on 32Gb devices: I have misread the “Jetson Product Updates” post. Nvidia communicated to us via their partner channel:

The AGX Xavier Industrial module is not available yet.

You can refer to Jetson Modules - Industrial Roadmap https://developer.nvidia.com/embedded/develop/roadmap.

Jetson Industrial modules shall support DRAM ECC, and DRAM ECC is enabled by default.

This may be of concern down the road, but does not affect devices which are currently available.