I’m using Nsight Compute to profile a vllm process, and trying to extract some low level metrics using the following command:
ncu --metrics smsp__sass_thread_inst_executed_op_fp8_pred_on.sum,smsp__sass_thread_inst_executed_op_hadd_pred_on.sum,smsp__sass_thread_inst_executed_op_hmul_pred_on.sum,smsp__sass_thread_inst_executed_op_hfma_pred_on.sum,smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,smsp__sass_thread_inst_executed_op_fmul_pred_on.sum,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,gpu__time_duration.sum,smsp__cycles_elapsed.sum,smsp__cycles_active.sum,smsp__sass_thread_inst_executed_op_fp8_pred_on.sum,smsp__sass_thread_inst_executed_op_fp8_fma_pred_on.sum --call-stack --launch-skip 0 --launch-count 4000 --replay-mode kernel -f --export smsp_timing python vllm_test.py
I keep getting the following error:
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250204-020536.pkl): CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
any ideas how to solve?
specs:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 Off | 00000000:04:00.0 Off | 0 |
| N/A 28C P0 71W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+