Hi all,
I’m facing an issue when profiling models in vLLM and seem to be getting an illegal instruction (error 715) as soon as I profile the model. Running without profiling works just fine without any issue. So the question is, what is changing under NCU profiling? Why am I not getting an illegal instruction error when running without profiling?
Here’s what I’m doing:
- I’m building vLLM (Python-build only) from source, since I’m doing some slight modifications to the repo in order to restrict the profiling range and only use a single hidden layer for the models I’m profiling.
- I’m profiling OPT 1.3B with a single hidden layer as follows
ncu --target-processes all \
--profile-from-start off \
--set full \
--replay-mode application \
-f -o output.ncu-rep \
python3 main.py
- I’m running on an H100 with CUDA 12.5, driver 555.42.06 and ncu version 2024.2.1.0, Ubuntu 22.04
- The error that I get is below
==PROF== Connected to process 3686 (/usr/bin/python3.10)
==WARNING== Unable to access the following 6 metrics: ctc__rx_bytes_data_user.sum, ctc__rx_bytes_data_user.sum.pct_of_peak_sustained_elapsed, ctc__rx_bytes_data_user.sum.per_second, ctc__tx_bytes_data_user.sum, ctc__tx_bytes_data_user.sum.pct_of_peak_sustained_elapsed, ctc__tx_bytes_data_user.sum.per_second.
==PROF== Profiling "unrolled_elementwise_kernel" - 0: Application replay pass 1
==PROF== Profiling "prepare_varlen_num_blocks_ker..." - 1: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 2: Application replay pass 1
==PROF== Profiling "triton_red_fused__to_copy_add..." - 3: Application replay pass 1
==PROF== Profiling "nvjet_hsh_192x128_64x5_1x2_h_..." - 4: Application replay pass 1
==PROF== Profiling "reshape_and_cache_flash_kernel" - 5: Application replay pass 1
==PROF== Profiling "device_kernel" - 6: Application replay pass 1
==PROF== Profiling "vectorized_elementwise_kernel" - 7: Application replay pass 1
==PROF== Profiling "nvjet_hsh_128x128_64x6_1x2_h_..." - 8: Application replay pass 1
==PROF== Profiling "triton_red_fused_add_addmm_na..." - 9: Application replay pass 1
==PROF== Profiling "nvjet_hsh_128x256_64x4_2x1_v_..." - 10: Application replay pass 1
==PROF== Profiling "triton_poi_fused_addmm_relu_1" - 11: Application replay pass 1
==PROF== Profiling "nvjet_hsh_128x128_64x6_1x2_h_..." - 12: Application replay pass 1
==PROF== Profiling "triton_red_fused_add_addmm_na..." - 13: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 14: Application replay pass 1
==PROF== Profiling "index_elementwise_kernel" - 15: Application replay pass 1
==PROF== Profiling "nvjet_hsh_384x8_64x4_2x1_v_bz..." - 16: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 17: Application replay pass 1
==PROF== Profiling "index_elementwise_kernel" - 18: Application replay pass 1
==PROF== Profiling "elementwise_kernel" - 19: Application replay pass 1
==PROF== Profiling "cunn_SoftMaxForward" - 20: Application replay pass 1
==PROF== Profiling "distribution_elementwise_grid..." - 21: Application replay pass 1
==PROF== Profiling "vectorized_elementwise_kernel" - 22: Application replay pass 1
==PROF== Profiling "reduce_kernel" - 23: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 24: Application replay pass 1
Step 0 took 2181.89990234375 ms
================= Inside execute_model, PID is 3837 times_called is 1, PROFS: False, False, True
==PROF== Profiling "unrolled_elementwise_kernel" - 25: Application replay pass 1
==PROF== Profiling "prepare_varlen_num_blocks_ker..." - 26: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 27: Application replay pass 1
==PROF== Profiling "triton_red_fused__to_copy_add..." - 28: Application replay pass 1
==PROF== Profiling "nvjet_hsh_64x8_64x16_4x1_v_bz..." - 29: Application replay pass 1
TMA Desc Addr: 0x7ffe214b7640
format 6
dim 4
gmem_address 0x79f09fa02000
globalDim (64,1,32,1,1)
globalStrides (2,12288,128,0,0)
boxDim (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 3
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 715
TMA Desc Addr: 0x7ffe214b7640
format 6
dim 4
gmem_address 0x79f0aa000000
globalDim (64,16,32,625914,1)
globalStrides (2,4096,128,65536,0)
boxDim (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 3
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 715
TMA Desc Addr: 0x7ffe214b7640
format 6
dim 4
gmem_address 0x79fa36fa0000
globalDim (64,16,32,625914,1)
globalStrides (2,4096,128,65536,0)
boxDim (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 3
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 715
TMA Desc Addr: 0x7ffe214b7640
format 6
dim 4
gmem_address 0x79f0aa000000
globalDim (64,16,32,625914,1)
globalStrides (2,4096,128,65536,0)
boxDim (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 3
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 715
TMA Desc Addr: 0x7ffe214b7640
format 6
dim 4
gmem_address 0x79fa36fa0000
globalDim (64,16,32,625914,1)
globalStrides (2,4096,128,65536,0)
boxDim (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave 0
swizzle 3
l2Promotion 2
oobFill 0
Error: Failed to initialize the TMA descriptor 715
CUDA error (/workspace/.deps/vllm-flash-attn-src/hopper/flash_fwd_launch_template.h:189): an illegal instruction was encountered
==PROF== Disconnected from process 3837
To check whether a more recent CUDA version would still face the same issue, I briefly spinned up two VMs in the Cloud with the following configurations
- H100 with CUDA 12.8.1, driver 570.124.06, ncu version 2025.1.0, Ubuntu 22.04: running the model works fine but again it fails as soon as I add the NCU profiling with the exact same error as above
- H100 with CUDA 12.9, driver 575.51.03, ncu version 2025.2.0.0, Ubuntu 22.04: NCU profiling now works fine without any errors. No idea however what has changed, from the changelog I didn’t see anything noticable.
For cost reasons I cannot run my measurements in the cloud but have to run them in an on-premise cluster where the H100 GPU is currently configured with CUDA 12.5, driver 555.42.06 and ncu 2024.2.1.0 (config of the stacktrace above). Upgrading to CUDA 12.9 is not an option in the short term unfortunately as it is a shared cluster and I am only a user of it.
I was hoping to understand what is causing the error to happen under NCU profiling while running without profiling works just fine. Any help is much appreciated!
Thanks a lot
