Illegal instruction (Error 715) when profiling vllm with NCU

elpaul · May 19, 2025, 5:48pm

Hi all,

I’m facing an issue when profiling models in vLLM and seem to be getting an illegal instruction (error 715) as soon as I profile the model. Running without profiling works just fine without any issue. So the question is, what is changing under NCU profiling? Why am I not getting an illegal instruction error when running without profiling?

Here’s what I’m doing:

I’m building vLLM (Python-build only) from source, since I’m doing some slight modifications to the repo in order to restrict the profiling range and only use a single hidden layer for the models I’m profiling.
I’m profiling OPT 1.3B with a single hidden layer as follows

ncu --target-processes all \
    --profile-from-start off \
    --set full \
    --replay-mode application \
    -f -o output.ncu-rep \
    python3 main.py

I’m running on an H100 with CUDA 12.5, driver 555.42.06 and ncu version 2024.2.1.0, Ubuntu 22.04
The error that I get is below

==PROF== Connected to process 3686 (/usr/bin/python3.10)
==WARNING== Unable to access the following 6 metrics: ctc__rx_bytes_data_user.sum, ctc__rx_bytes_data_user.sum.pct_of_peak_sustained_elapsed, ctc__rx_bytes_data_user.sum.per_second, ctc__tx_bytes_data_user.sum, ctc__tx_bytes_data_user.sum.pct_of_peak_sustained_elapsed, ctc__tx_bytes_data_user.sum.per_second.

==PROF== Profiling "unrolled_elementwise_kernel" - 0: Application replay pass 1
==PROF== Profiling "prepare_varlen_num_blocks_ker..." - 1: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 2: Application replay pass 1
==PROF== Profiling "triton_red_fused__to_copy_add..." - 3: Application replay pass 1
==PROF== Profiling "nvjet_hsh_192x128_64x5_1x2_h_..." - 4: Application replay pass 1
==PROF== Profiling "reshape_and_cache_flash_kernel" - 5: Application replay pass 1
==PROF== Profiling "device_kernel" - 6: Application replay pass 1
==PROF== Profiling "vectorized_elementwise_kernel" - 7: Application replay pass 1
==PROF== Profiling "nvjet_hsh_128x128_64x6_1x2_h_..." - 8: Application replay pass 1
==PROF== Profiling "triton_red_fused_add_addmm_na..." - 9: Application replay pass 1
==PROF== Profiling "nvjet_hsh_128x256_64x4_2x1_v_..." - 10: Application replay pass 1
==PROF== Profiling "triton_poi_fused_addmm_relu_1" - 11: Application replay pass 1
==PROF== Profiling "nvjet_hsh_128x128_64x6_1x2_h_..." - 12: Application replay pass 1
==PROF== Profiling "triton_red_fused_add_addmm_na..." - 13: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 14: Application replay pass 1
==PROF== Profiling "index_elementwise_kernel" - 15: Application replay pass 1
==PROF== Profiling "nvjet_hsh_384x8_64x4_2x1_v_bz..." - 16: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 17: Application replay pass 1
==PROF== Profiling "index_elementwise_kernel" - 18: Application replay pass 1
==PROF== Profiling "elementwise_kernel" - 19: Application replay pass 1
==PROF== Profiling "cunn_SoftMaxForward" - 20: Application replay pass 1
==PROF== Profiling "distribution_elementwise_grid..." - 21: Application replay pass 1
==PROF== Profiling "vectorized_elementwise_kernel" - 22: Application replay pass 1
==PROF== Profiling "reduce_kernel" - 23: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 24: Application replay pass 1
Step 0 took 2181.89990234375 ms
================= Inside execute_model, PID is 3837 times_called is 1, PROFS: False, False, True
==PROF== Profiling "unrolled_elementwise_kernel" - 25: Application replay pass 1
==PROF== Profiling "prepare_varlen_num_blocks_ker..." - 26: Application replay pass 1
==PROF== Profiling "unrolled_elementwise_kernel" - 27: Application replay pass 1
==PROF== Profiling "triton_red_fused__to_copy_add..." - 28: Application replay pass 1
==PROF== Profiling "nvjet_hsh_64x8_64x16_4x1_v_bz..." - 29: Application replay pass 1
TMA Desc Addr:   0x7ffe214b7640
format         6
dim            4
gmem_address   0x79f09fa02000
globalDim      (64,1,32,1,1)
globalStrides  (2,12288,128,0,0)
boxDim         (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 715
TMA Desc Addr:   0x7ffe214b7640
format         6
dim            4
gmem_address   0x79f0aa000000
globalDim      (64,16,32,625914,1)
globalStrides  (2,4096,128,65536,0)
boxDim         (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 715
TMA Desc Addr:   0x7ffe214b7640
format         6
dim            4
gmem_address   0x79fa36fa0000
globalDim      (64,16,32,625914,1)
globalStrides  (2,4096,128,65536,0)
boxDim         (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 715
TMA Desc Addr:   0x7ffe214b7640
format         6
dim            4
gmem_address   0x79f0aa000000
globalDim      (64,16,32,625914,1)
globalStrides  (2,4096,128,65536,0)
boxDim         (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 715
TMA Desc Addr:   0x7ffe214b7640
format         6
dim            4
gmem_address   0x79fa36fa0000
globalDim      (64,16,32,625914,1)
globalStrides  (2,4096,128,65536,0)
boxDim         (64,192,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        3
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 715
CUDA error (/workspace/.deps/vllm-flash-attn-src/hopper/flash_fwd_launch_template.h:189): an illegal instruction was encountered
==PROF== Disconnected from process 3837

To check whether a more recent CUDA version would still face the same issue, I briefly spinned up two VMs in the Cloud with the following configurations

H100 with CUDA 12.8.1, driver 570.124.06, ncu version 2025.1.0, Ubuntu 22.04: running the model works fine but again it fails as soon as I add the NCU profiling with the exact same error as above
H100 with CUDA 12.9, driver 575.51.03, ncu version 2025.2.0.0, Ubuntu 22.04: NCU profiling now works fine without any errors. No idea however what has changed, from the changelog I didn’t see anything noticable.

For cost reasons I cannot run my measurements in the cloud but have to run them in an on-premise cluster where the H100 GPU is currently configured with CUDA 12.5, driver 555.42.06 and ncu 2024.2.1.0 (config of the stacktrace above). Upgrading to CUDA 12.9 is not an option in the short term unfortunately as it is a shared cluster and I am only a user of it.

I was hoping to understand what is causing the error to happen under NCU profiling while running without profiling works just fine. Any help is much appreciated!

Thanks a lot

terryxhx · July 17, 2025, 6:22pm

Hi, I am also trying to profile vllm with ncu and encountered the same error. Did you resolve it?

May I ask your pytorch and python version when you worked with CUDA 12.9, it seems that pytorch still does not support cuda 12.9 yet, I tried cuda 12.9 but pytorch wheel with cuda128 but still got the same error.

veraj · July 18, 2025, 5:09am

Hi, @elpaul and @terryxhx

Sorry for the issue you met.
Would you please help trying fewer metrics to narrow down the issue instead of using “–set full”.
For example using--metrics sm__cycles_active.sum and see if ncu run to pass.

If this works then please try one section at a time using “–section ”. The section name can get by “–list-sections”

Thanks !

felix_dt · July 18, 2025, 7:09am

Upgrading to CUDA 12.9 is not an option in the short term unfortunately as it is a shared cluster and I am only a user of it.

Please use ncu 2025.2.1 with the driver and CTK combination that otherwise works for you. This ncu is compatible with other CUDA 12.x drivers and toolkits. You don’t have to pick the one from the CUDA toolkit installation. You can download it separately from Nsight Compute | NVIDIA Developer | NVIDIA Developer .

terryxhx · July 18, 2025, 7:40am

Thanks for your reply!

When I profiled with the only metric you provided, I got an error of LaunchFailed:

terryxhx · July 18, 2025, 8:19am

Thanks for your reply!

I installed ncu 2025.2.1 separately and tried it with cuda 12.8:
ubuntu@192-222-55-22:~/sept-latentwave/GPU-PMC-Verifier$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2025 NVIDIA Corporation
Version 2025.2.1.0 (build 35987062) (public-release)
ubuntu@192-222-55-22:~/sept-latentwave/GPU-PMC-Verifier$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

but received the same error as above.

Topic		Replies	Views
Ncu profiling failed to profile specific kernels Nsight Compute cuda	10	829	January 27, 2025
Segmentation fault Nsight Compute cuda , nsight	26	1504	July 16, 2024
Nsight-compute print "the application returned an error code (249)" Nsight Compute	5	1587	February 13, 2023
Restricted vs. unrestricted profiling Nsight Compute llama	5	15	January 23, 2026
Nsight Compute Profiling Challenges with FlashAttention Kernels in vLLM Nsight Compute	2	112	October 28, 2025
Ncu does not detect kernels, ==ERROR== The application returned an error code (11) Nsight Compute kernel , profiling	6	2172	December 13, 2023
Run ncu command in ubuntu 20.04 Nsight Compute	7	6021	August 8, 2022
Nsight Compute: Target process terminated before first instrumented API call Nsight Compute	3	1092	February 26, 2024
Ncu no kernels profiled -- Target process xxx terminated before first instrumented API call Nsight Compute cuda , kernel , python	5	275	March 18, 2025
NVIDIA NSight Compute: The profiler returned an error code:1 Nsight Compute	13	2341	March 18, 2024

Illegal instruction (Error 715) when profiling vllm with NCU

Related topics