Investigating metrics which cause profiler to fail

mahmood.nt · December 8, 2025, 2:09pm

Hi,

I am using the Nsight Compute 2025.1.0 which is shipped with CUDA 12.8 (driver 550.90.07) on a A100 80GB device. For Llama inference workload, I see several metrics fail. The following command shows that I use the warp memory stall metric for a specific kernel.

$ export GPU_COUNT=1
$ export BATCH_SIZE=32
$ export NCU=/mnt/sw/cuda-12.8.0/nsight-compute-2025.1.0/ncu
$ export KNAME="ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn"
$ export LLAMA_CMD="python -u main.py --scenario Offline --model-path $CHECKPOINT_PATH --batch-size $BATCH_SIZE --dtype bfloat16 --user-conf user.conf --total-sample-count 1 --dataset-path $DATASET_PATH --output-log-dir output --tensor-parallel-size $GPU_COUNT --vllm"
$ $NCU \
  --verbose \
  --kernel-name $KNAME \
  --metrics smsp__pcsamp_warps_issue_stalled_membar \
  --export test --force-overwrite \
  $LLAMA_CMD

The output is shown below:

==PROF== Profiling "ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn": 0%....50%....100% - 2 passes
==PROF== Profiling "ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn": 0%....50%....100% - 2 passes
==PROF== Profiling "ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn": 0%....50%....100% - 2 passes
==PROF== Profiling "ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn": 0%....50%....100% - 2 passes
==PROF== Profiling "ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn": 0%....50%....100% - 2 passes
==PROF== Profiling "ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn": 0%....50%....100% - 2 passes
INFO 12-08 08:57:30 gpu_executor.py:122] # GPU blocks: 28158, # CPU blocks: 2048
INFO 12-08 08:57:30 gpu_executor.py:126] Maximum concurrency for 131072 tokens per request: 3.44x
INFO 12-08 08:57:32 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-08 08:57:32 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
==PROF== Profiling "ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn": 0%
==ERROR== Failed to profile "ampere_bf16_s16816gemm_bf16_2..." in process 1498973
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==PROF== Report: /mnt/users/m/inference/language/llama3.1-8b/test.ncu-rep

I don’t know if that is related to the CUDA graph message or not, but before that, the kernel has been profiled correctly.

I know that there are more new versions of Nsight Compute, but the latest one is incompatible with the driver version. So, I am not sure if that problem is fixed in the latest version so that I can spend time upgrading the driver, though I have to ask the sysadmin for an update which may not be accepted.

There is no other way, as far as I know, to see why profiling that kernel at a specific step fails. As you can see, the failure is not happening at the first occurrence of the kernel.

Any idea about that? I see similar failures with other metrics.

mahmood.nt · January 22, 2026, 2:24pm

Any idea about this problem? Apparently, that kernel is fine with 1 pass metrics, e.g. instructions_executed, but it has problem with multi-pass metrics. Another weird thing is that the same kernel name is profiled correctly at the beginning but crashes at a specific invocation. Since this is the only output I see, it’s hard to investigate more. There is no verbose output about that too.

Topic		Replies	Views
LaunchFailed on windows, nsignt compute 2023.3 Nsight Compute	18	1231	January 29, 2024
Nsight compute hanging issue Nsight Compute kernel	7	1183	March 11, 2024
The profiler returned an error code: 3221226505 (0xc0000409) Nsight Compute	4	287	March 26, 2025
Ncu-ui not profiling some sections Nsight Compute	4	2512	November 26, 2020
LaunchFailed when using Nsight Compute 2025.2 Nsight Compute	9	227	September 30, 2025
How do i get some of the nvprof metrics in insight? Nsight Compute	0	774	June 2, 2021
[Resolved] Invalid Nsight Compute	1	538	July 6, 2019
Can't Get NCU GUI To Import Properly Nsight Compute	8	1550	October 5, 2020
nsight compute ui and cli can't profiling any cuda application Nsight Compute	6	3981	August 21, 2019
NsightCompute doesn't profile some metrics on SM_75 Nsight Compute	2	808	November 8, 2019

Investigating metrics which cause profiler to fail

Related topics