VLLM镜像: nvcr.io/nvidia/tritonserver:25.08-vllm-python-py3
1、已测试GPT-OSS 120B FP4、GPT-OSS 20B FP4、Qwen3-32B-FP4 均测试失败
原因:vllm 0.9.2版本不支持FP4量化和GPT-OSS模型
2、运行Qwen3-32B-INT4、Qwen3-32B-INT8模型报错: no kernel image is available for execution on the device
3、Qwen3-32B-BF16模型 单路 3.7token/s
4、实测Qwen3-32B-FP8模型,单路、5-20并发,output tokens数值不变,初步排查显存带宽问题。
使用torch脚本测试显存带宽:读:117GB/s 写:71GB/s
Hi,
You will need our latest vLLM container for the GPT-OSS support.
Please find more details in the comment below:
Thanks.
Hi,
In our testing, bandwidth is 23x GB/s, which is close the the spec 273GB/s:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA Thor, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0
0 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0
0 234.25
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0
0 232.63
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0
0 235.24
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0
0 239.80
P2P=Disabled Latency Matrix (us)
GPU 0
0 3.46
CPU 0
0 2.74
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0
0 2.91
CPU 0
0 2.43
Thanks.
这比我们测试的数据还是要高一些的,有什么配置或者其他前置因素需要调整吗?
You could see if more recent versions of vllm might help. edit, changed to correct image.
nvcr.io/nvidia/tritonserver:25.09-vllm-python-py3
vllm --version 0.10.1.1+381074ae
or
nvcr.io/nvidia/tritonserver:25.10-vllm-python-py3
vllm --version 0.10.2+9dd9ca32
Hi,
We maximize the device performance first:
$ sudo nvpmodel -m 0
$ sudo jetson_clocks
Thanks.



