Thor开发板上测试vllm失败

VLLM镜像: nvcr.io/nvidia/tritonserver:25.08-vllm-python-py3
1、已测试GPT-OSS 120B FP4、GPT-OSS 20B FP4、Qwen3-32B-FP4 均测试失败
原因:vllm 0.9.2版本不支持FP4量化和GPT-OSS模型
2、运行Qwen3-32B-INT4、Qwen3-32B-INT8模型报错: no kernel image is available for execution on the device
3、Qwen3-32B-BF16模型 单路 3.7token/s
4、实测Qwen3-32B-FP8模型,单路、5-20并发,output tokens数值不变,初步排查显存带宽问题。
使用torch脚本测试显存带宽:读:117GB/s 写:71GB/s

Hi,

You will need our latest vLLM container for the GPT-OSS support.
Please find more details in the comment below:

Thanks.

最新用p2pBandwidthLatencyTest测出来T5000单向写的速度是178GB/s,单向读的速度是225GB/s

这个数据还是有点低了,请看下还有什么需要调整的吗?谢谢!

另外,我们测到4090D是852GB/s,与官标数据显存带宽1008.0GB/s差一些

Hi,

In our testing, bandwidth is 23x GB/s, which is close the the spec 273GB/s:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA Thor, pciBusID: 1, pciDeviceID: 0, pciDomainID:0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0
     0	     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0 
     0 234.25 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0 
     0 232.63 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0 
     0 235.24 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0 
     0 239.80 
P2P=Disabled Latency Matrix (us)
   GPU     0 
     0   3.46 

   CPU     0 
     0   2.74 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0 
     0   2.91 

   CPU     0 
     0   2.43 

Thanks.

这比我们测试的数据还是要高一些的,有什么配置或者其他前置因素需要调整吗?

You could see if more recent versions of vllm might help. edit, changed to correct image.

nvcr.io/nvidia/tritonserver:25.09-vllm-python-py3

vllm --version 0.10.1.1+381074ae

or

nvcr.io/nvidia/tritonserver:25.10-vllm-python-py3 

vllm --version 0.10.2+9dd9ca32

Hi,

We maximize the device performance first:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.