Summary
On Jetson AGX Thor, after running a large vLLM workload and stopping the container, the system can enter a state where nvfancontrol and several thermal-related kernel workers are stuck in uninterruptible sleep (D state). The board then continues heating, normal shutdown may not complete, and only a forced power cycle/cold reset recovers it.
I originally saw this after using the Jetson Thor vLLM container with a Qwen3.6 27B INT4 model and interacting with it from another machine through an OpenAI-compatible HTTP API. I later reproduced the same thermal/BPMP stall with a standalone host-side CUDA cudaMallocManaged stress test, without Docker, PyTorch, or vLLM.
Hardware / OS
- Device: Jetson AGX Thor
- Architecture: aarch64
- OS: Ubuntu 24.04.4 LTS
- Kernel:
6.8.12-tegra - L4T:
R38.4.0 - JetPack:
7.1-b112 nvidia-l4t-core:38.4.0-20251230160601- CUDA compiler:
nvcc 13.0.48 - Memory: about 122 GiB
- Docker:
29.1.3-0ubuntu3~24.04.2 - containerd:
2.2.1-0ubuntu1~24.04.2 - runc:
1.3.4-0ubuntu1~24.04.1 - NVIDIA Container Toolkit:
1.18.1-1
vLLM Container
Image:
ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor
RepoDigest: ghcr.io/nvidia-ai-iot/vllm@sha256:b587dd56b4cb076209ad5156a626ac75f5a976d0e8e7d1e6a9fccd56d1bd65e8
Image ID: sha256:11544a7267571a837e2abc4a14be638257d7f402b0fc45d2223eec0f5f3e8c09
Created: 2026-04-06T20:36:27Z
Verified inside the container earlier:
torch.cuda.is_available() = True
GPU name = NVIDIA Thor
vLLM = 0.19.0+cu130
Transformers = 4.57.3
Model
Model repo: Lorbus/Qwen3.6-27B-int4-AutoRound
Local path: models/qwen3.6-27b-int4-autoround
Served name: qwen3.6-27b-int4
Base model: Qwen/Qwen3.6-27B
Quantization: INT4 W4A16 AutoRound
MTP head: preserved
Small local compatibility edit:
tokenizer_config.json:
"tokenizer_class": "TokenizersBackend"
changed to:
"tokenizer_class": "Qwen2TokenizerFast"
vLLM Launch Configuration
Important vLLM args:
--trust-remote-code
--tensor-parallel-size 1
--max-model-len 262144
--gpu-memory-utilization 0.58
--kv-cache-dtype fp8
--max-num-seqs 1
--max-num-batched-tokens 32768
--enable-chunked-prefill
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--override-generation-config '{"max_new_tokens":32768}'
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'
Docker-related args:
--name thor-vllm
--init
--restart no
--privileged
-v /dev:/dev
-v /usr/lib/aarch64-linux-gnu/nvidia:/host-nvidia-libs:ro
-v /opt/nvidia-libcompute-580/usr/lib/aarch64-linux-gnu:/opt/nvidia-libcompute-580/usr/lib/aarch64-linux-gnu:ro
-e LD_LIBRARY_PATH=/host-nvidia-libs:/opt/nvidia-libcompute-580/usr/lib/aarch64-linux-gnu:/usr/lib/aarch64-linux-gnu/nvidia:/usr/local/cuda-13.0/targets/sbsa-linux/lib:/usr/local/cuda/targets/sbsa-linux/lib
-e LD_PRELOAD=/host-nvidia-libs/libcuda.so.1
--ipc host
--ulimit memlock=-1:-1
--ulimit stack=67108864
-p 8100:8000
The explicit mounts are used because on this system nvidia-container-runtime did not provide a Jetson CSV mount spec, and the container otherwise missed host CUDA driver libraries.
Original Reproduction Path With vLLM
- Cold boot / clean boot.
- Confirm no existing
Dstate thermal processes usingps, nottegrastats. - Start the vLLM container.
- Wait for Qwen3.6 27B INT4 to load and expose the OpenAI-compatible HTTP server on port
8100. - From another machine, run a Hermes agent against:
http://<thor-ip>:8100/v1/chat/completions
- Complete one HTTP interaction round. I did not capture the exact request payload; it was a normal Hermes agent OpenAI-compatible chat/completions interaction from another machine.
- Stop the service:
docker stop --time 120 thor-vllm
docker rm thor-vllm
- After this, the system can enter a bad state:
nvfancontrolstuck inD- several
events_freezable_power_kworkers stuck inD tegrastats, if run afterwards, hangs in the thermal read path- old
vllm/VLLM::EngineCoreprocesses may remain as zombies - the board heats up
- normal shutdown may hang
- only force power-off / cold reset recovers it
Observed Blocked Stacks
nvfancontrol:
thermal_zone_get_temp
temp_show
dev_attr_show
sysfs_kf_seq_show
kernfs_seq_show
seq_read_iter
kernfs_fop_read_iter
vfs_read
ksys_read
Thermal kworkers:
tegra_bpmp_transfer
__thermal_zone_get_temp
__thermal_zone_device_update
thermal_zone_device_check
process_one_work
worker_thread
kthread
Example ps state after reproducing:
D kworker/u39:0+events_freezable_power_
D kworker/u29:2+events_freezable_power_
D kworker/u37:1+events_freezable_power_
Ds nvfancontrol /usr/sbin/nvfancontrol
D kworker/u29:5+events_freezable_power_
This looks like a BPMP/thermal query path stall: Linux thermal readers block while asking BPMP for temperature / thermal data.
Isolation Tests
I then tried to separate CPU/RAM pressure from CUDA unified-memory pressure.
CPU/RAM stress: did not reproduce
Host-side stress-ng, no CUDA, no Docker, no vLLM:
MemAvailable at start: ~117.2 GiB
Stress target: 75% of MemAvailable, ~87.9 GiB
Duration: 300 seconds
Workers: 14 cpu, 7 matrix, 7 memcpy, 1 vm
Result: 29 workers passed, 0 failed
No nvfancontrol D state
No thermal kworker D state
CUDA managed memory stress: reproduced
Host-side CUDA only, no Docker, no PyTorch, no vLLM:
MemAvailable at start: ~114.7 GiB
cudaMallocManaged target: 75% of MemAvailable, ~86.0 GiB
Device: NVIDIA Thor
Workload: GPU kernel repeatedly sweeps the managed-memory allocation
Result:
completed_sweeps: 91
runtime before detection: about 90 seconds
nvfancontrol entered D state
multiple thermal kworkers entered D state
Stacks again showed:
tegra_bpmp_transfer -> __thermal_zone_get_temp -> thermal_zone_device_check
and:
thermal_zone_get_temp -> temp_show -> sysfs read
So plain CPU/RAM pressure did not reproduce the issue, but large host-side cudaMallocManaged pressure did reproduce the same BPMP thermal stall without vLLM or Docker.
Expected Behavior
Large CUDA unified-memory workloads or vLLM workloads should either complete, fail cleanly, or be killable. Thermal polling should not permanently block nvfancontrol / thermal kworkers in D state.
Actual Behavior
Thermal/BPMP communication appears to stop returning. Any userspace or kernel worker reading thermal zones can become stuck in uninterruptible sleep. The system then cannot reliably control / observe thermals, may heat up, and normal shutdown may not work.
Questions
- Is this a known Jetson AGX Thor r38.4 BPMP / thermal firmware issue under large CUDA unified-memory pressure?
- Is there a newer BPMP firmware / JetPack / L4T build that addresses this?
- Are there recommended limits for
cudaMallocManaged/ unified-memory allocation size on Thor to avoid this? - What logs would NVIDIA want before rebooting? I avoided
tegrastatsafter the issue appears because it can also block inthermal_zone_get_temp. - Is there a safer way to recover or reset the BPMP thermal path without cold power cycling?
- Is there an NVIDIA-recommended, known-good complete vLLM reference implementation or launch configuration for Jetson AGX Thor that I should compare against?