Jetson AGX Thor r38.4: BPMP thermal path stalls after large CUDA/vLLM unified-memory workload; nvfancontrol and thermal kworkers stuck in D state

Summary

On Jetson AGX Thor, after running a large vLLM workload and stopping the container, the system can enter a state where nvfancontrol and several thermal-related kernel workers are stuck in uninterruptible sleep (D state). The board then continues heating, normal shutdown may not complete, and only a forced power cycle/cold reset recovers it.

I originally saw this after using the Jetson Thor vLLM container with a Qwen3.6 27B INT4 model and interacting with it from another machine through an OpenAI-compatible HTTP API. I later reproduced the same thermal/BPMP stall with a standalone host-side CUDA cudaMallocManaged stress test, without Docker, PyTorch, or vLLM.

Hardware / OS

  • Device: Jetson AGX Thor
  • Architecture: aarch64
  • OS: Ubuntu 24.04.4 LTS
  • Kernel: 6.8.12-tegra
  • L4T: R38.4.0
  • JetPack: 7.1-b112
  • nvidia-l4t-core: 38.4.0-20251230160601
  • CUDA compiler: nvcc 13.0.48
  • Memory: about 122 GiB
  • Docker: 29.1.3-0ubuntu3~24.04.2
  • containerd: 2.2.1-0ubuntu1~24.04.2
  • runc: 1.3.4-0ubuntu1~24.04.1
  • NVIDIA Container Toolkit: 1.18.1-1

vLLM Container

Image:

ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor
RepoDigest: ghcr.io/nvidia-ai-iot/vllm@sha256:b587dd56b4cb076209ad5156a626ac75f5a976d0e8e7d1e6a9fccd56d1bd65e8
Image ID: sha256:11544a7267571a837e2abc4a14be638257d7f402b0fc45d2223eec0f5f3e8c09
Created: 2026-04-06T20:36:27Z

Verified inside the container earlier:

torch.cuda.is_available() = True
GPU name = NVIDIA Thor
vLLM = 0.19.0+cu130
Transformers = 4.57.3

Model

Model repo: Lorbus/Qwen3.6-27B-int4-AutoRound
Local path: models/qwen3.6-27b-int4-autoround
Served name: qwen3.6-27b-int4
Base model: Qwen/Qwen3.6-27B
Quantization: INT4 W4A16 AutoRound
MTP head: preserved

Small local compatibility edit:

tokenizer_config.json:
"tokenizer_class": "TokenizersBackend"
changed to:
"tokenizer_class": "Qwen2TokenizerFast"

vLLM Launch Configuration

Important vLLM args:

--trust-remote-code
--tensor-parallel-size 1
--max-model-len 262144
--gpu-memory-utilization 0.58
--kv-cache-dtype fp8
--max-num-seqs 1
--max-num-batched-tokens 32768
--enable-chunked-prefill
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--override-generation-config '{"max_new_tokens":32768}'
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'

Docker-related args:

--name thor-vllm
--init
--restart no
--privileged
-v /dev:/dev
-v /usr/lib/aarch64-linux-gnu/nvidia:/host-nvidia-libs:ro
-v /opt/nvidia-libcompute-580/usr/lib/aarch64-linux-gnu:/opt/nvidia-libcompute-580/usr/lib/aarch64-linux-gnu:ro
-e LD_LIBRARY_PATH=/host-nvidia-libs:/opt/nvidia-libcompute-580/usr/lib/aarch64-linux-gnu:/usr/lib/aarch64-linux-gnu/nvidia:/usr/local/cuda-13.0/targets/sbsa-linux/lib:/usr/local/cuda/targets/sbsa-linux/lib
-e LD_PRELOAD=/host-nvidia-libs/libcuda.so.1
--ipc host
--ulimit memlock=-1:-1
--ulimit stack=67108864
-p 8100:8000

The explicit mounts are used because on this system nvidia-container-runtime did not provide a Jetson CSV mount spec, and the container otherwise missed host CUDA driver libraries.

Original Reproduction Path With vLLM

  1. Cold boot / clean boot.
  2. Confirm no existing D state thermal processes using ps, not tegrastats.
  3. Start the vLLM container.
  4. Wait for Qwen3.6 27B INT4 to load and expose the OpenAI-compatible HTTP server on port 8100.
  5. From another machine, run a Hermes agent against:
http://<thor-ip>:8100/v1/chat/completions
  1. Complete one HTTP interaction round. I did not capture the exact request payload; it was a normal Hermes agent OpenAI-compatible chat/completions interaction from another machine.
  2. Stop the service:
docker stop --time 120 thor-vllm
docker rm thor-vllm
  1. After this, the system can enter a bad state:
    • nvfancontrol stuck in D
    • several events_freezable_power_ kworkers stuck in D
    • tegrastats, if run afterwards, hangs in the thermal read path
    • old vllm / VLLM::EngineCore processes may remain as zombies
    • the board heats up
    • normal shutdown may hang
    • only force power-off / cold reset recovers it

Observed Blocked Stacks

nvfancontrol:

thermal_zone_get_temp
temp_show
dev_attr_show
sysfs_kf_seq_show
kernfs_seq_show
seq_read_iter
kernfs_fop_read_iter
vfs_read
ksys_read

Thermal kworkers:

tegra_bpmp_transfer
__thermal_zone_get_temp
__thermal_zone_device_update
thermal_zone_device_check
process_one_work
worker_thread
kthread

Example ps state after reproducing:

D  kworker/u39:0+events_freezable_power_
D  kworker/u29:2+events_freezable_power_
D  kworker/u37:1+events_freezable_power_
Ds nvfancontrol /usr/sbin/nvfancontrol
D  kworker/u29:5+events_freezable_power_

This looks like a BPMP/thermal query path stall: Linux thermal readers block while asking BPMP for temperature / thermal data.

Isolation Tests

I then tried to separate CPU/RAM pressure from CUDA unified-memory pressure.

CPU/RAM stress: did not reproduce

Host-side stress-ng, no CUDA, no Docker, no vLLM:

MemAvailable at start: ~117.2 GiB
Stress target: 75% of MemAvailable, ~87.9 GiB
Duration: 300 seconds
Workers: 14 cpu, 7 matrix, 7 memcpy, 1 vm
Result: 29 workers passed, 0 failed
No nvfancontrol D state
No thermal kworker D state

CUDA managed memory stress: reproduced

Host-side CUDA only, no Docker, no PyTorch, no vLLM:

MemAvailable at start: ~114.7 GiB
cudaMallocManaged target: 75% of MemAvailable, ~86.0 GiB
Device: NVIDIA Thor
Workload: GPU kernel repeatedly sweeps the managed-memory allocation

Result:

completed_sweeps: 91
runtime before detection: about 90 seconds
nvfancontrol entered D state
multiple thermal kworkers entered D state

Stacks again showed:

tegra_bpmp_transfer -> __thermal_zone_get_temp -> thermal_zone_device_check

and:

thermal_zone_get_temp -> temp_show -> sysfs read

So plain CPU/RAM pressure did not reproduce the issue, but large host-side cudaMallocManaged pressure did reproduce the same BPMP thermal stall without vLLM or Docker.

Expected Behavior

Large CUDA unified-memory workloads or vLLM workloads should either complete, fail cleanly, or be killable. Thermal polling should not permanently block nvfancontrol / thermal kworkers in D state.

Actual Behavior

Thermal/BPMP communication appears to stop returning. Any userspace or kernel worker reading thermal zones can become stuck in uninterruptible sleep. The system then cannot reliably control / observe thermals, may heat up, and normal shutdown may not work.

Questions

  1. Is this a known Jetson AGX Thor r38.4 BPMP / thermal firmware issue under large CUDA unified-memory pressure?
  2. Is there a newer BPMP firmware / JetPack / L4T build that addresses this?
  3. Are there recommended limits for cudaMallocManaged / unified-memory allocation size on Thor to avoid this?
  4. What logs would NVIDIA want before rebooting? I avoided tegrastats after the issue appears because it can also block in thermal_zone_get_temp.
  5. Is there a safer way to recover or reset the BPMP thermal path without cold power cycling?
  6. Is there an NVIDIA-recommended, known-good complete vLLM reference implementation or launch configuration for Jetson AGX Thor that I should compare against?

Hi,

We are not aware of an issue related to the managed memory on Thor.
However, there is a similar issue (tegra_bpmp_transfer gets stuck) on Orin.

Could you try the fix shared above to see if it helps?
If the issue still occurs, could you share a local CUDA-managed memory stress app with us so we can try it locally?

Thanks.

Hi NVIDIA team,

Thank you for the suggestion and for pointing me to the Orin `host1x-fence` lock-race issue.

I checked the Thor R38.4 source and tested the suggested direction locally. The Thor source does contain a similar-looking `host1x-fence` pattern around `host1x_pollfd_poll()` / `host1x_pollfd_release()`, so I built a minimal test module that changed the relevant local `spin_lock()` / `spin_unlock()` pair to `spin_lock_irqsave()` / `spin_unlock_irqrestore()`.

Unfortunately, that did **not** fix the issue on this Jetson AGX Thor system. With the test module loaded, the same failure still reproduced, and the blocked stacks still pointed at the BPMP/thermal path, for example:

```text

tegra_bpmp_transfer

__thermal_zone_get_temp

thermal_zone_device_check

```

So the Orin fix may be valid for that Orin issue, but it does not appear to be sufficient for this Thor R38.4 failure mode.

I have uploaded a self-contained CUDA managed-memory stress reproducer. On my Thor board, this reproducer currently triggers the issue every time I run it with the default settings.

To run it:

```bash

tar -xzf up_jetson_thor_repro_20260519.tar.gz

cd up

chmod +x run_repro.sh

sudo ./run_repro.sh

```

If `nvcc` is not in `PATH`:

```bash

sudo REPRO_NVCC=/usr/local/cuda-13.0/bin/nvcc ./run_repro.sh

```

What the reproducer does:

- runs directly on the host

- does not use Docker

- does not use PyTorch

- does not use vLLM

- allocates about 75% of current `MemAvailable` with `cudaMallocManaged`

- touches the managed buffer from the CPU

- repeatedly sweeps the same managed buffer from a GPU kernel

- avoids `tegrastats`, `jtop`, and thermal-zone sysfs reads

- logs `ps`, warning-level kernel messages, and candidate `/proc//stack` output

The script automatically checks for the target D-state condition. When reproduced, it writes files such as:

```text

logs//observe-*-stall-candidates.txt

logs//observe-*-candidate-stacks.txt

```

My local verification run is included in the uploaded folder:

```text

logs/20260519-122043/

```

That run captured `nvfancontrol` and several `events_freezable_power_` kworkers in `D` state. Representative stacks included:

```text

tegra_bpmp_transfer

__thermal_zone_get_temp

thermal_zone_get_temp

temp_show

```

and:

```text

tegra_bpmp_transfer

__thermal_zone_get_temp

__thermal_zone_device_update

thermal_zone_device_check

```

The original issue was first observed after running a large vLLM workload on Jetson AGX Thor. At first I suspected Docker, PyTorch, or vLLM, but this standalone CUDA managed-memory test reproduces the same BPMP/thermal D-state failure without any of those layers. My current assumption is that vLLM was only a convenient high-pressure CUDA workload that exposed the underlying Thor BPMP/thermal/CUDA-UVM interaction issue.

If the standalone reproducer does not trigger the problem on your Thor board, please also try a large vLLM workload with high GPU memory pressure. In my case, the original vLLM workload could leave the system in the same state:

```text

nvfancontrol stuck in D state

thermal-related kworkers stuck in D state

thermal reads / tegrastats can hang

normal shutdown may not complete

```

System details from my board:

```text

up_jetson_thor_repro_20260519.tar.gz (469.8 KB)

Device: Jetson AGX Thor

Architecture: aarch64

Kernel: 6.8.12-tegra

up_jetson_thor_repro_20260519.tar.gz (469.8 KB)

L4T: R38.4.0

JetPack: 7.1-b112

nvidia-l4t-core: 38.4.0-20251230160601

nvidia-l4t-kernel: 6.8.12-tegra-38.4.0-20251230160601

CUDA: 13.0

```

Please let me know what additional logs would be useful before the required cold reboot after reproduction.

Hi,

Below is what we got when running the app:

$ sudo ./run_repro.sh

[sudo] password for nvidia: 
== Jetson AGX Thor CUDA managed-memory BPMP thermal stall repro ==
Log directory: /home/nvidia/topic_370477/up/logs/20260520-070829
Safe observation: ps/dmesg/proc stack only; no thermal sysfs reads, no tegrastats, no jtop.
CUDA pressure: cudaMallocManaged at 75% of current MemAvailable for 300s.
MemAvailable at start: 24.8 GiB
cudaMallocManaged target: 18.6 GiB (19999595520 bytes)
Compiling repro with /usr/local/cuda-13.0/bin/nvcc
CUDA repro pid=44536
CUDA log: /home/nvidia/topic_370477/up/logs/20260520-070829/cuda-managed-repro.log

Detected BPMP/thermal stall candidate. Stopping CUDA repro.
Result: BPMP_THERMAL_STALL_CANDIDATE_DETECTED

Suppose we can get reproduce locally, right?
Which logs/commands should we check to see the D state?

Thanks.

Hi,

Yes, that output means the reproducer hit the target condition locally.

Please check these files in the generated log directory:

logs/20260520-070829/observe-*-stall-candidates.txt
logs/20260520-070829/observe-*-candidate-stacks.txt
logs/20260520-070829/run.log
logs/20260520-070829/system-info.txt

The D-state processes should be listed in:

observe-*-stall-candidates.txt

The kernel stacks should be in:

observe-*-candidate-stacks.txt

You can also check manually with:

ps -eo state,pid,comm,wchan:40,args | awk '$1 ~ /D/ {print}'

For each D-state PID:

sudo cat /proc/<pid>/stack

The target signature is nvfancontrol or events_freezable_power_ workers stuck in D state, with stacks containing paths like:

tegra_bpmp_transfer
__thermal_zone_get_temp
thermal_zone_device_check

Thanks.

Hi

I hope you are doing well.

I just wanted to kindly follow up on this issue. I am not sure whether you have already seen my previous reply and whether you were able to find the D-state processes and the corresponding kernel stacks in the generated logs.

I would also like to ask whether this behavior may indicate a hardware issue on my Jetson AGX Thor. If so, would replacing the hardware potentially resolve the problem?

Alternatively, is this a real/common issue that can also be reproduced on your side, or is it more likely caused by a mistake in my setup? If it is a real issue, may I ask whether the team is considering a fix? This problem almost prevents me from deploying large language models with frameworks such as vLLM on this device. If it is caused by something I did incorrectly, I would be very grateful if you could give me some guidance on where the issue might be.

Thank you very much for your time and effort. I really appreciate your help. I wish you good health and smooth work.

Thanks.

Hi,

Sorry for the late update.

We found the following message in the logs:
observe-065654-stall-candidates.txt

 162250       2 R       10:06:29 tegra_bpmp_transfer                      kworker/u28:1+d [kworker/u28:1+devfreq_wq]

observe-065654-candidate-stacks.txt

== pid 162250 stack ==
[<0>] worker_thread+0x20c/0x440
[<0>] kthread+0x110/0x124
[<0>] ret_from_fork+0x10/0x20

Is this the state D issue you reported previously?
The temperature of the device did not go up after reproducing the issue.
We can also change the nvpmodel normally.

Thanks.

up/logs/20260519-122043$ cat observe-122326-stall-candidates.txt
   1615       1 Ds         52:41 tegra_bpmp_transfer                      nvfancontrol    /usr/sbin/nvfancontrol &
   2098       2 D          52:39 tegra_bpmp_transfer                      kworker/u32:3+e [kworker/u32:3+events_freezable_power_]
   6725       2 D          34:48 tegra_bpmp_transfer                      kworker/u35:2+e [kworker/u35:2+events_freezable_power_]
   6769       2 D          27:04 tegra_bpmp_transfer                      kworker/u30:0+e [kworker/u30:0+events_freezable_power_]
   6891       2 D          06:42 thermal_zone_device_check                kworker/u35:0+e [kworker/u35:0+events_freezable_power_]
   7825       2 D          00:05 tegra_bpmp_transfer                      kworker/u35:1+e [kworker/u35:1+events_freezable_power_]

Hi NVIDIA team,

Thank you for checking.

This does not seem to be the same failure I am reporting. In your log, the task was in R state and it was a devfreq_wq worker:

162250       2 R       10:06:29 tegra_bpmp_transfer                      kworker/u28:1+d [kworker/u28:1+devfreq_wq]

When I run the same reproducer on my Jetson AGX Thor, I see D / Ds state tasks, especially nvfancontrol and events_freezable_power_ workers.

Here is the exact sequence I just ran again on my board:

cd /home/thor/projects/thor-vllm-ngc-2601/up
sudo ./run_repro.sh

The script output was:

== Jetson AGX Thor CUDA managed-memory BPMP thermal stall repro ==
Log directory: /home/thor/projects/thor-vllm-ngc-2601/up/logs/20260525-230019
Safe observation: ps/dmesg/proc stack only; no thermal sysfs reads, no tegrastats, no jtop.
CUDA pressure: cudaMallocManaged at 75% of current MemAvailable for 300s.
MemAvailable at start: 116.0 GiB
cudaMallocManaged target: 87.0 GiB (93433967616 bytes)
Compiling repro with /usr/local/cuda-13.0/bin/nvcc
CUDA repro pid=15083
CUDA log: /home/thor/projects/thor-vllm-ngc-2601/up/logs/20260525-230019/cuda-managed-repro.log
Detected BPMP/thermal stall candidate. Stopping CUDA repro.
Result: BPMP_THERMAL_STALL_CANDIDATE_DETECTED

After that, I checked the D-state tasks with:

ps -eo state,pid,comm,wchan:40,args | awk '$1 ~ /D/ {print}'

The output was:

D    1616 nvfancontrol    -                                        /usr/sbin/nvfancontrol &
D    2982 kworker/u31:3+e -                                        [kworker/u31:3+events_freezable_power_]
D   12453 kworker/u32:0+e -                                        [kworker/u32:0+events_freezable_power_]
D   14533 kworker/u31:0+e -                                        [kworker/u31:0+events_freezable_power_]
D   14546 kworker/u29:2+e -                                        [kworker/u29:2+events_freezable_power_]
D   14572 kworker/u32:2+e -                                        [kworker/u32:2+events_freezable_power_]

The script also recorded this in:

logs/20260525-230019/observe-230312-stall-candidates.txt

The contents were:

   1616       1 Ds         29:11 tegra_bpmp_transfer                      nvfancontrol    /usr/sbin/nvfancontrol &
   2982       2 D          26:44 tegra_bpmp_transfer                      kworker/u31:3+e [kworker/u31:3+events_freezable_power_]
  12453       2 D          23:43 thermal_zone_device_check                kworker/u32:0+e [kworker/u32:0+events_freezable_power_]
  14533       2 D          07:55 tegra_bpmp_transfer                      kworker/u31:0+e [kworker/u31:0+events_freezable_power_]
  14546       2 D          07:15 tegra_bpmp_transfer                      kworker/u29:2+e [kworker/u29:2+events_freezable_power_]
  14572       2 D          06:08 tegra_bpmp_transfer                      kworker/u32:2+e [kworker/u32:2+events_freezable_power_]

The corresponding stack file was:

logs/20260525-230019/observe-230312-candidate-stacks.txt

Representative nvfancontrol stack:

tegra_bpmp_transfer
__thermal_zone_get_temp
thermal_zone_get_temp
temp_show

Representative thermal worker stack:

tegra_bpmp_transfer
__thermal_zone_get_temp
__thermal_zone_device_update
thermal_zone_device_check

Could you please try the reproducer one more time and check specifically for D / Ds state tasks like nvfancontrol or events_freezable_power_?

If you still cannot reproduce this same D-state thermal/BPMP stall on your side, should I consider this a possible hardware issue with my Jetson AGX Thor board?

Thank you very much.

up.zip (1.0 MB)

yes,thank you :)

Hi NVIDIA team,

I would like to give an update on this issue.

After doing a clean system reinstall on my Jetson AGX Thor, the problem has temporarily disappeared on my side.

I re-ran the same high-pressure CUDA managed-memory reproducer with the default 75% MemAvailable / 300s settings. This time the workload completed successfully. During the run I only saw a few short transient devfreq_wq / tegra_bpmp_transfer observations, but they cleared by themselves. I did not see the previous failure signature:

  • no nvfancontrol stuck in D state
  • no events_freezable_power_ thermal kworkers stuck in D state
  • no persistent BPMP/thermal stall
  • after the run, the system state was clean and nvfancontrol was still running normally

So for now, the issue appears to be resolved after reinstalling the system. It may have been related to my previous local installation/configuration state rather than a consistently reproducible platform issue.

Thank you very much for your time, patience, and help in checking this. I really appreciate it.

If the issue appears again in the future, I will update this thread with fresh logs.

Hi,

Thanks a lot for the update.
Good to know it can work now.