Illegal instruction (Error 715) when profiling vllm with NCU

Hi all,

I’m facing an issue when profiling models in vLLM and seem to be getting an illegal instruction (error 715) as soon as I profile the model. Running without profiling works just fine without any issue. So the question is, what is changing under NCU profiling? Why am I not getting an illegal instruction error when running without profiling?

Here’s what I’m doing:

  • I’m building vLLM (Python-build only) from source, since I’m doing some slight modifications to the repo in order to restrict the profiling range and only use a single hidden layer for the models I’m profiling.
  • I’m profiling OPT 1.3B with a single hidden layer as follows
ncu --target-processes all \
    --profile-from-start off \
    --set full \
    --replay-mode application \
    -f -o output.ncu-rep \
    python3 main.py
  • I’m running on an H100 with CUDA 12.5, driver 555.42.06 and ncu version 2024.2.1.0, Ubuntu 22.04
  • The error that I get is below
==PROF== Connected to process 3686 (/usr/bin/python3.10)
==WARNING== Unable to access the following 6 metrics: ctc__rx_bytes_data_user.sum, ctc__rx_bytes_data_user.sum.pct_of_peak_sustained_elapsed, ctc__rx_bytes_data_user.sum.per_second, ctc__tx_bytes_data_user.sum, ctc__tx_bytes_data_user.sum.pct_of_peak_sustained_elapsed, ctc__tx_bytes_data_user.sum.per_second.

==PROF== Profiling "unrolled_elementwise_kernel" - 0: Application replay pass 1
==PROF== Profiling "prepare_varlen_num_blocks_ker..." - 1: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 2: Application replay pass 1
==PROF== Profiling "triton_red_fused__to_copy_add..." - 3: Application replay pass 1
==PROF== Profiling "nvjet_hsh_192x128_64x5_1x2_h_..." - 4: Application replay pass 1
==PROF== Profiling "reshape_and_cache_flash_kernel" - 5: Application replay pass 1
==PROF== Profiling "device_kernel" - 6: Application replay pass 1
==PROF== Profiling "vectorized_elementwise_kernel" - 7: Application replay pass 1
==PROF== Profiling "nvjet_hsh_128x128_64x6_1x2_h_..." - 8: Application replay pass 1
==PROF== Profiling "triton_red_fused_add_addmm_na..." - 9: Application replay pass 1
==PROF== Profiling "nvjet_hsh_128x256_64x4_2x1_v_..." - 10: Application replay pass 1
==PROF== Profiling "triton_poi_fused_addmm_relu_1" - 11: Application replay pass 1
==PROF== Profiling "nvjet_hsh_128x128_64x6_1x2_h_..." - 12: Application replay pass 1
==PROF== Profiling "triton_red_fused_add_addmm_na..." - 13: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 14: Application replay pass 1
==PROF== Profiling "index_elementwise_kernel" - 15: Application replay pass 1
==PROF== Profiling "nvjet_hsh_384x8_64x4_2x1_v_bz..." - 16: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 17: Application replay pass 1
==PROF== Profiling "index_elementwise_kernel" - 18: Application replay pass 1
==PROF== Profiling "elementwise_kernel" - 19: Application replay pass 1
==PROF== Profiling "cunn_SoftMaxForward" - 20: Application replay pass 1
==PROF== Profiling "distribution_elementwise_grid..." - 21: Application replay pass 1
==PROF== Profiling "vectorized_elementwise_kernel" - 22: Application replay pass 1
==PROF== Profiling "reduce_kernel" - 23: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 24: Application replay pass 1
Step 0 took 2181.89990234375 ms
================= Inside execute_model, PID is 3837 times_called is 1, PROFS: False, False, True
==PROF== Profiling "unrolled_elementwise_kernel" - 25: Application replay pass 1
==PROF== Profiling "prepare_varlen_num_blocks_ker..." - 26: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 27: Application replay pass 1
==PROF== Profiling "triton_red_fused__to_copy_add..." - 28: Application replay pass 1
==PROF== Profiling "nvjet_hsh_64x8_64x16_4x1_v_bz..." - 29: Application replay pass 1
TMA Desc Addr:   0x7ffe214b7640
format         6
dim            4
gmem_address   0x79f09fa02000
globalDim      (64,1,32,1,1)
globalStrides  (2,12288,128,0,0)
boxDim         (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 715
TMA Desc Addr:   0x7ffe214b7640
format         6
dim            4
gmem_address   0x79f0aa000000
globalDim      (64,16,32,625914,1)
globalStrides  (2,4096,128,65536,0)
boxDim         (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 715
TMA Desc Addr:   0x7ffe214b7640
format         6
dim            4
gmem_address   0x79fa36fa0000
globalDim      (64,16,32,625914,1)
globalStrides  (2,4096,128,65536,0)
boxDim         (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 715
TMA Desc Addr:   0x7ffe214b7640
format         6
dim            4
gmem_address   0x79f0aa000000
globalDim      (64,16,32,625914,1)
globalStrides  (2,4096,128,65536,0)
boxDim         (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 715
TMA Desc Addr:   0x7ffe214b7640
format         6
dim            4
gmem_address   0x79fa36fa0000
globalDim      (64,16,32,625914,1)
globalStrides  (2,4096,128,65536,0)
boxDim         (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 715
CUDA error (/workspace/.deps/vllm-flash-attn-src/hopper/flash_fwd_launch_template.h:189): an illegal instruction was encountered
==PROF== Disconnected from process 3837

To check whether a more recent CUDA version would still face the same issue, I briefly spinned up two VMs in the Cloud with the following configurations

  • H100 with CUDA 12.8.1, driver 570.124.06, ncu version 2025.1.0, Ubuntu 22.04: running the model works fine but again it fails as soon as I add the NCU profiling with the exact same error as above
  • H100 with CUDA 12.9, driver 575.51.03, ncu version 2025.2.0.0, Ubuntu 22.04: NCU profiling now works fine without any errors. No idea however what has changed, from the changelog I didn’t see anything noticable.

For cost reasons I cannot run my measurements in the cloud but have to run them in an on-premise cluster where the H100 GPU is currently configured with CUDA 12.5, driver 555.42.06 and ncu 2024.2.1.0 (config of the stacktrace above). Upgrading to CUDA 12.9 is not an option in the short term unfortunately as it is a shared cluster and I am only a user of it.

I was hoping to understand what is causing the error to happen under NCU profiling while running without profiling works just fine. Any help is much appreciated!

Thanks a lot

1 Like

Hi, I am also trying to profile vllm with ncu and encountered the same error. Did you resolve it?

May I ask your pytorch and python version when you worked with CUDA 12.9, it seems that pytorch still does not support cuda 12.9 yet, I tried cuda 12.9 but pytorch wheel with cuda128 but still got the same error.

Hi, @elpaul and @terryxhx

Sorry for the issue you met.
Would you please help trying fewer metrics to narrow down the issue instead of using “–set full”.
For example using--metrics sm__cycles_active.sum and see if ncu run to pass.

If this works then please try one section at a time using “–section ”. The section name can get by “–list-sections”

Thanks !

Upgrading to CUDA 12.9 is not an option in the short term unfortunately as it is a shared cluster and I am only a user of it.

Please use ncu 2025.2.1 with the driver and CTK combination that otherwise works for you. This ncu is compatible with other CUDA 12.x drivers and toolkits. You don’t have to pick the one from the CUDA toolkit installation. You can download it separately from Nsight Compute | NVIDIA Developer | NVIDIA Developer .

Thanks for your reply!

When I profiled with the only metric you provided, I got an error of LaunchFailed:

Thanks for your reply!

I installed ncu 2025.2.1 separately and tried it with cuda 12.8:
ubuntu@192-222-55-22:~/sept-latentwave/GPU-PMC-Verifier$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2025 NVIDIA Corporation
Version 2025.2.1.0 (build 35987062) (public-release)
ubuntu@192-222-55-22:~/sept-latentwave/GPU-PMC-Verifier$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

but received the same error as above.