CUBLAS_STATUS_EXECUTION_FAILED

I’m using Nsight Compute to profile a vllm process, and trying to extract some low level metrics using the following command:
ncu --metrics smsp__sass_thread_inst_executed_op_fp8_pred_on.sum,smsp__sass_thread_inst_executed_op_hadd_pred_on.sum,smsp__sass_thread_inst_executed_op_hmul_pred_on.sum,smsp__sass_thread_inst_executed_op_hfma_pred_on.sum,smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,smsp__sass_thread_inst_executed_op_fmul_pred_on.sum,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,gpu__time_duration.sum,smsp__cycles_elapsed.sum,smsp__cycles_active.sum,smsp__sass_thread_inst_executed_op_fp8_pred_on.sum,smsp__sass_thread_inst_executed_op_fp8_fma_pred_on.sum --call-stack --launch-skip 0 --launch-count 4000 --replay-mode kernel -f --export smsp_timing python vllm_test.py

I keep getting the following error:
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250204-020536.pkl): CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

any ideas how to solve?

specs:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:04:00.0 Off |                    0 |
| N/A   28C    P0             71W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

My initial recommendation is to remove all smsp__sass* metrics from the command line. If the error goes away, then the issue is likely with the SASS patching logic. You can then start adding back in the smsp__sass* metrics to determine the specific metric or metrics causing the failure.

The non-SASS metrics do not increase or impact the grid execution. The smsp__sass* metrics binary patch the shader opcodes transparently changing the control flow and execution mix. The increase in instructions executed can result in 10-100x increase is duration. The smsp__sass* metrics are collected in a different replay than the hardware performance counters and SM program counter sampling.