CUBLAS_STATUS_EXECUTION_FAILED

netaaviram · February 4, 2025, 10:34am

I’m using Nsight Compute to profile a vllm process, and trying to extract some low level metrics using the following command:
ncu --metrics smsp__sass_thread_inst_executed_op_fp8_pred_on.sum,smsp__sass_thread_inst_executed_op_hadd_pred_on.sum,smsp__sass_thread_inst_executed_op_hmul_pred_on.sum,smsp__sass_thread_inst_executed_op_hfma_pred_on.sum,smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,smsp__sass_thread_inst_executed_op_fmul_pred_on.sum,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,gpu__time_duration.sum,smsp__cycles_elapsed.sum,smsp__cycles_active.sum,smsp__sass_thread_inst_executed_op_fp8_pred_on.sum,smsp__sass_thread_inst_executed_op_fp8_fma_pred_on.sum --call-stack --launch-skip 0 --launch-count 4000 --replay-mode kernel -f --export smsp_timing python vllm_test.py

I keep getting the following error:
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250204-020536.pkl): CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

any ideas how to solve?

specs:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:04:00.0 Off |                    0 |
| N/A   28C    P0             71W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Greg · February 4, 2025, 8:14pm

My initial recommendation is to remove all smsp__sass* metrics from the command line. If the error goes away, then the issue is likely with the SASS patching logic. You can then start adding back in the smsp__sass* metrics to determine the specific metric or metrics causing the failure.

The non-SASS metrics do not increase or impact the grid execution. The smsp__sass* metrics binary patch the shader opcodes transparently changing the control flow and execution mix. The increase in instructions executed can result in 10-100x increase is duration. The smsp__sass* metrics are collected in a different replay than the hardware performance counters and SM program counter sampling.

veraj · February 28, 2025, 1:27pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cublas_status_execution_failed GPU-Accelerated Libraries	2	10678	February 23, 2021
Sqlite does not contain CUDA kernel data CUDA on Windows Subsystem for Linux	12	3679	April 28, 2023
n/a for metrics Nsight Compute	8	1593	December 26, 2019
Warning: ctc__rx_bytes_data_user.sum cant be measured? Nsight Compute	3	141	July 30, 2024
cuBLAS handle creation fails CUDA Programming and Performance	1	505	June 13, 2022
NVIDIA NSight Compute: The profiler returned an error code:1 Nsight Compute	13	1936	March 18, 2024
cuBLAS call from kernel in CUDA 10.0 GPU-Accelerated Libraries	9	4840	April 7, 2021
Unable to run several CUDA samples. CUDA Programming and Performance	2	823	April 1, 2019
What is the meaning of error in Nsight UI Diagnostics Summary Profiling Linux Targets	3	950	February 2, 2023
Nsight compute cublas profiling stall Nsight Compute	2	333	July 22, 2024

CUBLAS_STATUS_EXECUTION_FAILED

Related topics