Hi,
I am using the Nsight Compute 2025.1.0 which is shipped with CUDA 12.8 (driver 550.90.07) on a A100 80GB device. For Llama inference workload, I see several metrics fail. The following command shows that I use the warp memory stall metric for a specific kernel.
$ export GPU_COUNT=1
$ export BATCH_SIZE=32
$ export NCU=/mnt/sw/cuda-12.8.0/nsight-compute-2025.1.0/ncu
$ export KNAME="ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn"
$ export LLAMA_CMD="python -u main.py --scenario Offline --model-path $CHECKPOINT_PATH --batch-size $BATCH_SIZE --dtype bfloat16 --user-conf user.conf --total-sample-count 1 --dataset-path $DATASET_PATH --output-log-dir output --tensor-parallel-size $GPU_COUNT --vllm"
$ $NCU \
--verbose \
--kernel-name $KNAME \
--metrics smsp__pcsamp_warps_issue_stalled_membar \
--export test --force-overwrite \
$LLAMA_CMD
The output is shown below:
==PROF== Profiling "ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn": 0%....50%....100% - 2 passes
==PROF== Profiling "ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn": 0%....50%....100% - 2 passes
==PROF== Profiling "ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn": 0%....50%....100% - 2 passes
==PROF== Profiling "ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn": 0%....50%....100% - 2 passes
==PROF== Profiling "ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn": 0%....50%....100% - 2 passes
==PROF== Profiling "ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn": 0%....50%....100% - 2 passes
INFO 12-08 08:57:30 gpu_executor.py:122] # GPU blocks: 28158, # CPU blocks: 2048
INFO 12-08 08:57:30 gpu_executor.py:126] Maximum concurrency for 131072 tokens per request: 3.44x
INFO 12-08 08:57:32 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-08 08:57:32 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
==PROF== Profiling "ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_tn": 0%
==ERROR== Failed to profile "ampere_bf16_s16816gemm_bf16_2..." in process 1498973
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==PROF== Report: /mnt/users/m/inference/language/llama3.1-8b/test.ncu-rep
I don’t know if that is related to the CUDA graph message or not, but before that, the kernel has been profiled correctly.
I know that there are more new versions of Nsight Compute, but the latest one is incompatible with the driver version. So, I am not sure if that problem is fixed in the latest version so that I can spend time upgrading the driver, though I have to ask the sysadmin for an update which may not be accepted.
There is no other way, as far as I know, to see why profiling that kernel at a specific step fails. As you can see, the failure is not happening at the first occurrence of the kernel.
Any idea about that? I see similar failures with other metrics.