Thermal performance difference between TX2 modules

We are running thermal tests with two new TX2 modules (heatsink DCV-01672-N2-G) and got considerable worse results than the TX2 module from EVM board. is there any difference between them?
The result we analyze is the temperature difference between “TTP” and “GPU”, where difference in EVM board module is around 17C and in our module is 30C instead.
We use carrier board P2597 and monitor TTP with thermocoupler running this stress test.

# performance mode
sudo nvpmodel -m 0
sudo jetson_clocks

# cpu stress test
stress --cpu 6 --io 6 --vm 2 --vm-bytes 2048M &

while [ true ]
do 
    /home/nvidia/NVIDIA_CUDA-10.0_Samples/5_Simulations/nbody_opengles/nbody_opengles -benchmark -fp64 -fullscreen -numbodies=1000000
done

Results:

Case 1:
Nvidia module: Shipped with P2597 evm board running Jetpack 4.2.
Ambient temperature around 26C.
TTP = 54C
Result Temp difference GPU - TTP = 71.5C-53C = 17.5C
Log from tegrastats:
09/18/19 15:26:34: RAM 4642/7860MB (lfb 594x4MB) CPU [100%@2035,100%@2035,100%@2035,100%@2035,100%@2035,100%@2035] EMC_FREQ 0% GR3D_FREQ
99% PLL@64.5C MCPU@64.5C PMIC@100C Tboard@53C GPU@71.5C BCPU@64.5C thermal@67.3C Tdiode@71.5C VDD_SYS_GPU 11748/10605 VDD_SYS_SOC 1144/12
91 VDD_4V0_WIFI 0/10 VDD_IN 20424/19739 VDD_SYS_CPU 4958/5077 VDD_SYS_DDR 1840/1981

Case 2:
Carrier board: P2597
Nvidia module: Our module with DCV-01672-N2-GP, thermal pad 86x66x0.5mm, K=2, running Jetpack 4.2.
Ambient temperature around 26C.
TTP = 57C
Result Temp difference GPU - TTP = 87.5C-57C = 30.5C
Log from tegrastats:
10/21/19 18:18:21: RAM 2379/7860MB (lfb 980x4MB) CPU [100%@2035,100%@2035,100%@2035,100%@2035,100%@2035,100%@2035] EMC_FREQ 0% GR3D_FREQ
99% PLL@83C MCPU@83C PMIC@100C Tboard@56C GPU@87.5C BCPU@83C thermal@85C Tdiode@88C VDD_SYS_GPU 12887/11385 VDD_SYS_SOC 1220/1280 VDD_4V0
_WIFI 0/11 VDD_IN 22466/20531 VDD_SYS_CPU 5797/5357 VDD_SYS_DDR 1875/1826

thoughts?

Hi, can you provide the serial number of all modules? They should be on labels.

EVM board module:
S/N: 0421518034673
699-83310-1000-B02 C

Our two modules:
S/N: 1422519117022
699-83310-1000-D00 F

S/N: 1422519117160
699-83310-1000-D00 F

Hi, the two modules came with heatsink + fan? And same use case (workload), same carrier board (P2597 of CVB) on all modules?

The modules didn’t come with heatsink, we bought them (DCV-01672-N2-G).
Testing one carrier board P2597 and we swap the modules for testing, running same sw and use case.

What kind of HS_TIM are you using on bought module to heatsink? Per thermal design guide below info, the HS_TIM might be the cause of this difference.

HS_TIM - The customer is responsible for providing the thermal interface material between the TTP and the thermal solution. For best thermal performance, the TIM should provide low thermal impedance within the mechanical, reliability, and cost constraints of the customer’s product.

We use thermal pad 86x66x0.5mm K=2, similar specs and material we use for other products, let me know if need more details or any suggested HS_TIM?

But in my opinion is not related to heatsink and HS_TIM, notice that I’m talking about the temperature difference between TTP and internal sensors, let me put it another way, we are also testing our bought module without heatsink and no fans in a controlled temperature chamber and reach 73C TTP temperature (thermocouple located middle of module) and results show module already starts clock throttling. We will try to test later to reach 80C TTP.

Results for our bought bare module (no heatsink, no fans) at 73C TTP:
RAM 2546/7860MB (lfb 1030x4MB) CPU [100%@1728,100%@1728,100%@1728,100%@1728,100%@1728,100%@1728] EMC_FREQ 0% GR3D_FREQ 99% PLL@93C MCPU@93C PMIC@100C Tboard@71C GPU@95.5C BCPU@93C thermal@94C Tdiode@95.5C VDD_SYS_GPU 6163/6597 VDD_SYS_SOC 1194/1181 VDD_4V0_WIFI 0/10 VDD_IN 14098/14381 VDD_SYS_CPU 3774/3877 VDD_SYS_DDR 1754/1758

Now, because we don’t want to remove EVM TX2 module heatsink for now and can’t measure suggested TTP location, running same test is not exactly possible, we locate thermocouple at center of heatsink instead and reach 73C. there is no clock throttling in this case.

Results EVM TX2 module at 73C in center of heatsink:
RAM 3121/7860MB (lfb 868x4MB) CPU [100%@2035,100%@2035,100%@2035,100%@2035,100%@2035,100%@2035] EMC_FREQ 0% GR3D_FREQ 99
% PLL@83.5C MCPU@83.5C PMIC@100C Tboard@72C GPU@91C BCPU@83.5C thermal@86.5C Tdiode@91.5C VDD_SYS_GPU 11888/11696 VDD_SYS_SOC 1189/1128 VDD
_4V0_WIFI 0/10 VDD_IN 21446/21031 VDD_SYS_CPU 5376/5339 VDD_SYS_DDR 2031/1967

Hi dbnvai,

We are investigating this internally. Can you please share below info?

-HS and fan P/N and pictures
-The contact pictures between the TTP and HS

Sorry, didn’t notice your reply.
Couldn’t find how to share pictures of our heatsink, trying

But as mentioned in last message, we are also testing without heatsink and without fan.

How do we prove/test our modules comply with this specification “The temperature of the TTP must always be kept under this 80 °C limit in order to maintain the required performance and reliability”? if TTP is kept under 80C, should we expect NO clock-throttle engaging at any point?

Yes, TTP should be under 80C, that is the target of thermal solution and that can guarantee module internal temperature is in range.

then why clock-throttle is engaging at TTP temps under 80C in our tests?

This is data I shared on a previous post:
Results for our bought bare module (no heatsink, no fans) at 73C TTP:
RAM 2546/7860MB (lfb 1030x4MB) CPU [100%@1728,100%@1728,100%@1728,100%@1728,100%@1728,100%@1728] EMC_FREQ 0% GR3D_FREQ 99% PLL@93C MCPU@93C PMIC@100C Tboard@71C GPU@95.5C BCPU@93C thermal@94C Tdiode@95.5C VDD_SYS_GPU 6163/6597 VDD_SYS_SOC 1194/1181 VDD_4V0_WIFI 0/10 VDD_IN 14098/14381 VDD_SYS_CPU 3774/3877 VDD_SYS_DDR 1754/1758

The measurement location of TTP is provided in Figure 3-2 in thermal design guide, was the sensor placed at that point? Our thermal team have done some comparison tests, but not find such big difference. Thermal team suggest to test all boards with total same settings such as without FAN/HS and get the temp of TTP at the official test point.

Yes, sensor placed at that point and measured with Fluke 51 thermometer after system running for 2 hours.
We have tested two boards and get same results, we have other 3 new boards that didn’t test yet, would prefer to have some conclusion with the current two boards first.

Do you mean the test settings of the two boards are totally same to that of devkit? Which means you remove the HS on devkit and test on same point on TTP? That’s different result to ours.

Hi dbnvai,

Have you done the test with other 3 new boards? Any conclusion?

Hi,
From the 5 new modules, we have tested 2 modules only. I might test one more module.

Which means you remove the HS on devkit and test on same point on TTP? That’s different result to >> ours.
No, I didn’t remove HS from devkit TX2 module, I don’t want to remove it.

Our intention for now is to prove “The temperature of the TTP must always be kept under this 80 °C limit in order to maintain the required performance and reliability”.

I will test 3 modules again without HS using devkit. What’s your suggestion how to conduct this test? What results(gpu temp value, etc) should we expect? Any suggestion on the stress test program we are using (first post)?

To test in the same point as TDG says and the CPU, GPU temp are expected less than the “Recommended Tegra X2 operating temperature limit”.

Finally was able to do tests again and got similar results.

Test Description:

P2597 evm board
Two new TX2 modules without Heatsink, no fan, fw Jetpack 4.3
Running stress test from first post
Thermometer: Graphtec GL840, T-type Thermocouple, ± 0.6 ºC measurement accuracy
Thermocouple location - TTP temperature: Jetson_TX2_Thermal_Design_Guide_v1.0 - Figure 3-2
DUT inside chamber adjusting ambient temperature to different points: -20C, 10C, 15C, 18C
TX2 logs retrieved with tegrastats command

Test Results for one TX2 module: (Data captured 4 hours after Chamber temperature is stable)

Ambient temperature = -20C
TX2 TTP temperature = 29.5C
RAM 2524/7860MB (lfb 979x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@2035,100%@2035,100%@2035,100%@2035,100%@2035,100%@2035] EMC_FREQ 0% GR3D_FREQ 99% PLL@47C MCPU@47C PMIC@100C Tboard@27C GPU@51.5C BCPU@47C thermal@48.8C Tdiode@51.5C VDD_SYS_GPU 10571/10358 VDD_SYS_SOC 1072/1063 VDD_4V0_WIFI 0/9 VDD_IN 19060/19033 VDD_SYS_CPU 4443/4577 VDD_SYS_DDR 1798/1811

Result: Pass - No CPU throttling

Ambient temperature = 10C
TX2 TTP temperature = 67.0C
RAM 2405/7860MB (lfb 1169x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@2035,100%@2035,100%@2035,100%@2035,100%@2035,100%@2035] EMC_FREQ 0% GR3D_FREQ 99% PLL@87C MCPU@87C PMIC@100C Tboard@63C GPU@92.5C BCPU@87C thermal@89.5C Tdiode@92C VDD_SYS_GPU 11915/11912 VDD_SYS_SOC 1451/1377 VDD_4V0_WIFI 0/15 VDD_IN 22111/21877 VDD_SYS_CPU 5576/5495 VDD_SYS_DDR 1990/1872

Result: Pass - No CPU throttling

Ambient temperature = 15C
TX2 TTP temperature = 68.5C
RAM 3971/7860MB (lfb 780x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@2035,100%@2035,100%@2035,100%@2035,100%@2035,100%@2035] EMC_FREQ 0% GR3D_FREQ 99% PLL@90.5C MCPU@90.5C PMIC@100C Tboard@65C GPU@95C BCPU@90.5C thermal@92.3C Tdiode@94.5C VDD_SYS_GPU 11457/11041 VDD_SYS_SOC 1451/1448 VDD_4V0_WIFI 0/10 VDD_IN 21461/21183 VDD_SYS_CPU 5499/5613 VDD_SYS_DDR 1856/1874

Result: Pass - No CPU throttling

Ambient temperature = 18C
TX2 TTP temperature = 69.1C
02/02/20 06:13:01: RAM 3771/7860MB (lfb 822x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@2035,100%@2035,100%@2035,100%@2035,100%@2035,100%@2035] EMC_FREQ 0% GR3D_FREQ 99% PLL@91C MCPU@91C PMIC@100C Tboard@67C GPU@95.5C BCPU@91C thermal@93.1C Tdiode@95C VDD_SYS_GPU 10392/10346 VDD_SYS_SOC 1451/1449 VDD_4V0_WIFI 0/10 VDD_IN 20734/20252 VDD_SYS_CPU 5805/5368 VDD_SYS_DDR 1856/1870
02/02/20 06:13:02: RAM 3236/7860MB (lfb 955x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@2035,100%@2035,100%@2035,100%@2035,100%@2035,100%@2035] EMC_FREQ 0% GR3D_FREQ 99% PLL@91C MCPU@91C PMIC@100C Tboard@67C GPU@95.5C BCPU@91C thermal@92.8C Tdiode@95C VDD_SYS_GPU 10392/10346 VDD_SYS_SOC 1451/1449 VDD_4V0_WIFI 0/10 VDD_IN 20696/20252 VDD_SYS_CPU 5728/5368 VDD_SYS_DDR 1875/1870
02/02/20 06:13:03: RAM 4705/7860MB (lfb 588x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@2035,100%@2035,100%@2035,100%@2035,100%@2035,100%@2035] EMC_FREQ 0% GR3D_FREQ 99% PLL@91C MCPU@91C PMIC@100C Tboard@67C GPU@95.5C BCPU@91C thermal@93.1C Tdiode@94.75C VDD_SYS_GPU 10392/10346 VDD_SYS_SOC 1452/1449 VDD_4V0_WIFI 0/10 VDD_IN 20581/20252 VDD_SYS_CPU 5728/5369 VDD_SYS_DDR 1837/1870
02/02/20 06:13:04: RAM 2128/7860MB (lfb 1232x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@2035,100%@2035,100%@2035,100%@2035,100%@2035,100%@2035] EMC_FREQ 0% GR3D_FREQ 99% PLL@91C MCPU@91C PMIC@100C Tboard@67C GPU@95.5C BCPU@91C thermal@93.1C Tdiode@95C VDD_SYS_GPU 10392/10346 VDD_SYS_SOC 1451/1449 VDD_4V0_WIFI 0/10 VDD_IN 20734/20252 VDD_SYS_CPU 5805/5369 VDD_SYS_DDR 1894/1870
02/02/20 06:13:05: RAM 3414/7860MB (lfb 911x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@1881,100%@1881,100%@1881,100%@1881,100%@1881,100%@1881] EMC_FREQ 0% GR3D_FREQ 99% PLL@91C MCPU@91C PMIC@100C Tboard@67C GPU@95.5C BCPU@91C thermal@92.8C Tdiode@94.75C VDD_SYS_GPU 10396/10346 VDD_SYS_SOC 1453/1449 VDD_4V0_WIFI 0/10 VDD_IN 19786/20252 VDD_SYS_CPU 4892/5369 VDD_SYS_DDR 1837/1870
02/02/20 06:13:06: RAM 3665/7860MB (lfb 848x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@1881,100%@1881,100%@1881,100%@1881,100%@1881,100%@1881] EMC_FREQ 0% GR3D_FREQ 99% PLL@90.5C MCPU@90.5C PMIC@100C Tboard@67C GPU@95.5C BCPU@90.5C thermal@92.8C Tdiode@94.75C VDD_SYS_GPU 10396/10346 VDD_SYS_SOC 1453/1449 VDD_4V0_WIFI 0/10 VDD_IN 19556/20252 VDD_SYS_CPU 4816/5368 VDD_SYS_DDR 1779/1870
02/02/20 06:13:07: RAM 1869/7860MB (lfb 1272x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@1881,100%@1881,100%@1881,100%@1881,100%@1881,100%@1881] EMC_FREQ 0% GR3D_FREQ 99% PLL@90.5C MCPU@90.5C PMIC@100C Tboard@67C GPU@95C BCPU@90.5C thermal@92.5C Tdiode@94.75C VDD_SYS_GPU 10396/10346 VDD_SYS_SOC 1376/1449 VDD_4V0_WIFI 0/10 VDD_IN 19488/20251 VDD_SYS_CPU 4739/5368 VDD_SYS_DDR 1760/1870
02/02/20 06:13:08: RAM 2203/7860MB (lfb 1213x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@1881,100%@1881,100%@1881,100%@1881,100%@1881,100%@1881] EMC_FREQ 0% GR3D_FREQ 99% PLL@90.5C MCPU@90.5C PMIC@100C Tboard@67C GPU@95.5C BCPU@90.5C thermal@92.3C Tdiode@94.5C VDD_SYS_GPU 10396/10346 VDD_SYS_SOC 1453/1449 VDD_4V0_WIFI 0/10 VDD_IN 19633/20251 VDD_SYS_CPU 4892/5368 VDD_SYS_DDR 1779/1870
02/02/20 06:13:09: RAM 2899/7860MB (lfb 1039x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@1881,100%@1881,100%@1881,100%@1881,100%@1881,100%@1881] EMC_FREQ 0% GR3D_FREQ 99% PLL@90.5C MCPU@90.5C PMIC@100C Tboard@67C GPU@95.5C BCPU@90.5C thermal@92.5C Tdiode@94.5C VDD_SYS_GPU 10396/10346 VDD_SYS_SOC 1453/1449 VDD_4V0_WIFI 0/10 VDD_IN 19595/20251 VDD_SYS_CPU 4739/5368 VDD_SYS_DDR 1817/1870
02/02/20 06:13:10: RAM 3764/7860MB (lfb 823x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@1881,100%@1881,100%@1881,100%@1881,100%@1881,100%@1881] EMC_FREQ 0% GR3D_FREQ 99% PLL@90C MCPU@90C PMIC@100C Tboard@67C GPU@95.5C BCPU@90C thermal@92.5C Tdiode@94.5C VDD_SYS_GPU 10396/10346 VDD_SYS_SOC 1453/1449 VDD_4V0_WIFI 0/10 VDD_IN 19518/20251 VDD_SYS_CPU 4663/5368 VDD_SYS_DDR 1856/1870
02/02/20 06:13:11: RAM 2474/7860MB (lfb 975x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@2035,100%@2035,100%@2035,100%@2035,100%@2035,100%@2035] EMC_FREQ 0% GR3D_FREQ 99% PLL@90.5C MCPU@90.5C PMIC@100C Tboard@67C GPU@95C BCPU@90.5C thermal@92.5C Tdiode@94.25C VDD_SYS_GPU 10396/10346 VDD_SYS_SOC 1453/1449 VDD_4V0_WIFI 0/10 VDD_IN 19710/20251 VDD_SYS_CPU 4892/5368 VDD_SYS_DDR 1837/1870
02/02/20 06:13:12: RAM 1342/7860MB (lfb 1360x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@1881,100%@1881,100%@1881,100%@1881,100%@1881,100%@1881] EMC_FREQ 0% GR3D_FREQ 99% PLL@90.5C MCPU@90.5C PMIC@100C Tboard@67C GPU@95C BCPU@90.5C thermal@92.5C Tdiode@94.25C VDD_SYS_GPU 10396/10346 VDD_SYS_SOC 1453/1449 VDD_4V0_WIFI 0/10 VDD_IN 19556/20251 VDD_SYS_CPU 4739/5368 VDD_SYS_DDR 1837/1870
02/02/20 06:13:13: RAM 2830/7860MB (lfb 1056x4MB) SWAP 0/3930MB (cached 0MB) CPU [100%@2035,100%@2035,100%@2035,100%@2035,100%@2035,100%@2035] EMC_FREQ 0% GR3D_FREQ 99% PLL@91C MCPU@91C PMIC@100C Tboard@67C GPU@95.5C BCPU@91C thermal@92.6C Tdiode@94.5C VDD_SYS_GPU 10392/10346 VDD_SYS_SOC 1452/1449 VDD_4V0_WIFI 0/10 VDD_IN 20322/20251 VDD_SYS_CPU 5425/5368 VDD_SYS_DDR 1856/1870

Result: Failed - CPU throttling, CPU frequency swings between 100%@2035 and 100%@1881

Comments and Questions:

At TTP temperature around 70C we already get CPU throttling.
Is the CPU throttling event log somewhere? don’t see anything in dmesg

Hi dbnvai, thanks for the detail test steps and result, we will check it. Furthermore, can you please share the photo of test equipment settings?

I still can’t figure it out how to share pictures here? I press the image icon and it shows this