Using cuDNN Backend to Create a Fused Attention fprop Graph

I am working on implementing the Fused Attention fprop. As of now I am only combining two matrix multiplications, meaning g3 and g4 are empty. I believe I have also matched all the requirements for this graph but none of the engine configurations provided by the engine heuristic work when passed to an execution plan. When finalizing the exec plan using any of the engine configurations the status CUDNN_STATUS_NOT_SUPPORTED is returned.

I have pasted the implementation I am using as well as the shapes and strides of all tensors used. Please let me know what I am doing incorrectly or any other information to help.

qshape: 4: 1 1 10 64 
qstride: 4: 640 640 64 1 
kshape: 4: 1 1 64 10 
kstride: 4: 640 640 1 64 
sshape: 4: 1 1 10 10 
sstride: 4: 100 100 10 1 
vshape: 4: 1 1 10 64 
vstride: 4: 640 640 64 1 
oshape: 4: 1 1 10 64 
ostride: 4: 640 640 64 1
#include <cudnn.h>
#include <iostream>
#include <vector>

#define CUDNN_CHECK(status)                                                    \
    {                                                                          \
        if (status != CUDNN_STATUS_SUCCESS) {                                  \
            fprintf(stderr, "cuDNN error: %s:%d:%s\n", __FILE__, __LINE__,     \
                    cudnnGetErrorString(status));                              \
            std::exit(EXIT_FAILURE);                                           \
        }                                                                      \
    }

void print_vector(const std::vector<int64_t> &v, std::string name) {
    std::cout << name << ": " << v.size() << ": ";
    for (int64_t i : v) {
        std::cout << i << " ";
    }
    std::cout << std::endl;
}

std::vector<int64_t> standard_4d_strides(const std::vector<int64_t> &shape) {
    return {shape[1] * shape[2] * shape[3], shape[2] * shape[3], shape[3], 1};
}

cudnnBackendDescriptor_t
tensor_descriptor(const std::vector<int64_t> &shape,
                  const std::vector<int64_t> &strides, int64_t id,
                  cudnnDataType_t data_type, int64_t byte_alignment,
                  bool is_virtual, bool reordering_fp16x16 = false) {
    cudnnBackendDescriptor_t desc;
    CUDNN_CHECK(
        cudnnBackendCreateDescriptor(CUDNN_BACKEND_TENSOR_DESCRIPTOR, &desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(desc, CUDNN_ATTR_TENSOR_UNIQUE_ID,
                                         CUDNN_TYPE_INT64, 1, &id));
    CUDNN_CHECK(cudnnBackendSetAttribute(desc, CUDNN_ATTR_TENSOR_DATA_TYPE,
                                         CUDNN_TYPE_DATA_TYPE, 1, &data_type));
    CUDNN_CHECK(cudnnBackendSetAttribute(desc, CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
                                         CUDNN_TYPE_INT64, 1, &byte_alignment));
    CUDNN_CHECK(cudnnBackendSetAttribute(desc, CUDNN_ATTR_TENSOR_DIMENSIONS,
                                         CUDNN_TYPE_INT64,
                                         (int64_t)shape.size(), shape.data()));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        desc, CUDNN_ATTR_TENSOR_STRIDES, CUDNN_TYPE_INT64,
        (int64_t)strides.size(), strides.data()));
    CUDNN_CHECK(cudnnBackendSetAttribute(desc, CUDNN_ATTR_TENSOR_IS_VIRTUAL,
                                         CUDNN_TYPE_BOOLEAN, 1, &is_virtual));
    if (reordering_fp16x16) {
        cudnnBackendTensorReordering_t reorder = CUDNN_TENSOR_REORDERING_F16x16;
        CUDNN_CHECK(cudnnBackendSetAttribute(
            desc, CUDNN_ATTR_TENSOR_REORDERING_MODE,
            CUDNN_TYPE_TENSOR_REORDERING_MODE, 1, &reorder));
    }
    CUDNN_CHECK(cudnnBackendFinalize(desc));
    return desc;
}

int main() {
    std::vector<int64_t> shape_query = {1, 1, 10, 64};
    std::vector<int64_t> strides_query = standard_4d_strides(shape_query);
    std::vector<int64_t> shape_key = {1, 1, 10, 64};
    std::vector<int64_t> strides_key = standard_4d_strides(shape_key);
    std::swap(shape_key[2], shape_key[3]);
    std::swap(strides_key[2], strides_key[3]);
    std::vector<int64_t> shape_value = {1, 1, 10, 64};
    std::vector<int64_t> strides_value = standard_4d_strides(shape_value);

    std::vector<int64_t> shape_scores = {shape_query[0], shape_query[1],
                                         shape_query[2], shape_key[3]};
    std::vector<int64_t> strides_scores = standard_4d_strides(shape_scores);

    std::vector<int64_t> shape_output = {shape_query[0], shape_query[1],
                                         shape_query[2], shape_value[3]};
    std::vector<int64_t> strides_output = standard_4d_strides(shape_output);
    cudnnHandle_t handle;
    CUDNN_CHECK(cudnnCreate(&handle));
    cudnnDataType_t comp_type = CUDNN_DATA_FLOAT;
    cudnnDataType_t data_type = CUDNN_DATA_HALF;
    int64_t data_type_byte_alignment = 2;

    cudnnBackendDescriptor_t query_desc =
        tensor_descriptor(shape_query, strides_query, 'q', data_type,
                          data_type_byte_alignment, false);
    cudnnBackendDescriptor_t key_desc =
        tensor_descriptor(shape_key, strides_key, 'k', data_type,
                          data_type_byte_alignment, false);
    cudnnBackendDescriptor_t value_desc =
        tensor_descriptor(shape_value, strides_value, 'v', data_type,
                          data_type_byte_alignment, false);
    cudnnBackendDescriptor_t scores_desc =
        tensor_descriptor(shape_scores, strides_scores, 's', data_type,
                          data_type_byte_alignment, true, true);
    cudnnBackendDescriptor_t output_desc =
        tensor_descriptor(shape_output, strides_output, 'o', data_type,
                          data_type_byte_alignment, false);

    print_vector(shape_query, "qshape");
    print_vector(strides_query, "qstride");
    print_vector(shape_key, "kshape");
    print_vector(strides_key, "kstride");
    print_vector(shape_scores, "sshape");
    print_vector(strides_scores, "sstride");
    print_vector(shape_value, "vshape");
    print_vector(strides_value, "vstride");
    print_vector(shape_output, "oshape");
    print_vector(strides_output, "ostride");

    cudnnBackendDescriptor_t matmul_desc;
    CUDNN_CHECK(cudnnBackendCreateDescriptor(CUDNN_BACKEND_MATMUL_DESCRIPTOR,
                                             &matmul_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(matmul_desc,
                                         CUDNN_ATTR_MATMUL_COMP_TYPE,
                                         CUDNN_TYPE_DATA_TYPE, 1, &comp_type));
    CUDNN_CHECK(cudnnBackendFinalize(matmul_desc));
    cudnnBackendDescriptor_t op_matmul;
    CUDNN_CHECK(cudnnBackendCreateDescriptor(
        CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR, &op_matmul));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_matmul, CUDNN_ATTR_OPERATION_MATMUL_DESC,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &matmul_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_matmul, CUDNN_ATTR_OPERATION_MATMUL_ADESC,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &query_desc));
    CUDNN_CHECK(
        cudnnBackendSetAttribute(op_matmul, CUDNN_ATTR_OPERATION_MATMUL_BDESC,
                                 CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &key_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_matmul, CUDNN_ATTR_OPERATION_MATMUL_CDESC,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &scores_desc));
    CUDNN_CHECK(cudnnBackendFinalize(op_matmul));

    cudnnBackendDescriptor_t out_matmul_desc;
    CUDNN_CHECK(cudnnBackendCreateDescriptor(CUDNN_BACKEND_MATMUL_DESCRIPTOR,
                                             &out_matmul_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(out_matmul_desc,
                                         CUDNN_ATTR_MATMUL_COMP_TYPE,
                                         CUDNN_TYPE_DATA_TYPE, 1, &comp_type));
    CUDNN_CHECK(cudnnBackendFinalize(out_matmul_desc));
    cudnnBackendDescriptor_t op_matmul_output;
    CUDNN_CHECK(cudnnBackendCreateDescriptor(
        CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR, &op_matmul_output));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_matmul_output, CUDNN_ATTR_OPERATION_MATMUL_DESC,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &out_matmul_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_matmul_output, CUDNN_ATTR_OPERATION_MATMUL_ADESC,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &scores_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_matmul_output, CUDNN_ATTR_OPERATION_MATMUL_BDESC,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &value_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_matmul_output, CUDNN_ATTR_OPERATION_MATMUL_CDESC,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &output_desc));
    CUDNN_CHECK(cudnnBackendFinalize(op_matmul_output));

    cudnnBackendDescriptor_t op_graph;
    cudnnBackendDescriptor_t ops[] = {op_matmul, op_matmul_output};
    CUDNN_CHECK(cudnnBackendCreateDescriptor(
        CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR, &op_graph));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_graph, CUDNN_ATTR_OPERATIONGRAPH_OPS, CUDNN_TYPE_BACKEND_DESCRIPTOR,
        sizeof(ops) / sizeof(ops[0]), ops));
    CUDNN_CHECK(cudnnBackendSetAttribute(op_graph,
                                         CUDNN_ATTR_OPERATIONGRAPH_HANDLE,
                                         CUDNN_TYPE_HANDLE, 1, &handle));
    CUDNN_CHECK(cudnnBackendFinalize(op_graph));

    cudnnBackendDescriptor_t heur_desc;
    CUDNN_CHECK(cudnnBackendCreateDescriptor(
        CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR, &heur_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        heur_desc, CUDNN_ATTR_ENGINEHEUR_OPERATION_GRAPH,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &op_graph));
    cudnnBackendHeurMode_t heur_mode = CUDNN_HEUR_MODE_FALLBACK;
    CUDNN_CHECK(cudnnBackendSetAttribute(heur_desc, CUDNN_ATTR_ENGINEHEUR_MODE,
                                         CUDNN_TYPE_HEUR_MODE, 1, &heur_mode));
    CUDNN_CHECK(cudnnBackendFinalize(heur_desc));
    int64_t count = 0;
    CUDNN_CHECK(cudnnBackendGetAttribute(
        heur_desc, CUDNN_ATTR_ENGINEHEUR_RESULTS, CUDNN_TYPE_BACKEND_DESCRIPTOR,
        0, &count, NULL));
    std::vector<cudnnBackendDescriptor_t> eng_cfgs(count);
    for (cudnnBackendDescriptor_t &cfg : eng_cfgs) {
        CUDNN_CHECK(cudnnBackendCreateDescriptor(
            CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, &cfg));
    }
    CUDNN_CHECK(cudnnBackendGetAttribute(
        heur_desc, CUDNN_ATTR_ENGINEHEUR_RESULTS, CUDNN_TYPE_BACKEND_DESCRIPTOR,
        count, nullptr, eng_cfgs.data()));

    for (cudnnBackendDescriptor_t &cfg : eng_cfgs) {
        cudnnBackendDescriptor_t exec_plan;
        CUDNN_CHECK(cudnnBackendCreateDescriptor(
            CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR, &exec_plan));
        CUDNN_CHECK(cudnnBackendSetAttribute(
            exec_plan, CUDNN_ATTR_EXECUTION_PLAN_ENGINE_CONFIG,
            CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &cfg));
        CUDNN_CHECK(cudnnBackendSetAttribute(exec_plan,
                                             CUDNN_ATTR_EXECUTION_PLAN_HANDLE,
                                             CUDNN_TYPE_HANDLE, 1, &handle));
        cudnnStatus_t status = cudnnBackendFinalize(exec_plan);
        std::cout << cudnnGetErrorString(status) << "\n";
        if (status == CUDNN_STATUS_SUCCESS) {
            std::cout << "success\n";
        }
    }

    // To be filled in

    return 0;
}

Hello,

What kind of device do you have?
Maybe you should check the log or read the last cudnn error.

There will probbably be a Warning like this:

i!     engineConfig: type=json; val={"engineId":7,"smVersion":890,"knobChoices":{"CUDNN_KNOB_TYPE_SPLIT_K_SLC":-1,"CUDNN_KNOB_TYPE_KERNEL_CFG":25}};
i!     Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: Multi-GEMM is only supported on Hopper at: receive_op_->target != fort_device::SM90
i!     Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: check_matmul_support_fort(node)
i!     Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: check_node_support_fort(node_ptr)
i!     Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: check_for_support()
i!     Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: (CUDNN_STATUS_NOT_INITIALIZED == status) || (CUDNN_STATUS_NOT_SUPPORTED == ((status) / 1000 * 1000))
i!     Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: ptr->isSupported()
i!     Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: finalize_internal()

Note:

Reason: Multi-GEMM is only supported on Hopper at: receive_op_->target != fort_device::SM90

Hope this helps.

Thank you for your reply.

I am using cuDNN 9.1.1, I’ve tried running the program with logging enabled to capture error messages, but it’s not providing much additional information.

When logging only errors and warnings, nothing is reported. Logging everything doesn’t show anything irregular about the function calls. When using CUDNN_HEUR_MODE_FALLBACK as the heuristic mode, none of the execution plans using the engine configurations returned by the heuristic successfully finalize. When using CUDNN_HEUR_MODE_A, 0 engine configurations are returned. I’ve tried using mode A, as it is reported to support patterns compatible with the runtime fusion engine.

When the status CUDNN_STATUS_NOT_SUPPORTED is returned during finalization of the execution plans, cudnnGetLastErrorString is empty.

Is there another way to gain insight into why the execution plans are failing to finalize?

The failing of the execution plan comes from the failing of finalizing the engineconfig.

Is there another way to gain insight into why the execution plans are failing to finalize?

yes, with logging.

Here is a Link to the documentation, showing how to use logging: Troubleshooting — NVIDIA cuDNN

Set Log-Level to 3 and send the log to a file, i.E. “CUDNN_LOGDEST_DBG=filename.txt” (if you use systemvariables, a system restart may be required). The log will show why the engineConfig could not be created in a more detailed way (see my last post).
There is no error when Heuristic fails to finalize a engineconfig, but a information, therefore log-level 3 is required.

However, according to your test-setup (two MatMul in a single graph) I’d guess you run into this problem, because your device does not support Multi-GEMM (Multiple MatMul Operation in a single graph). If you do not have a hopper device (i.e. H100 or H200), this is not supported ( CUDNN_STATUS_NOT_SUPPORTED).

Try to create a graph with multiple pointwise-operations, that should work.

I followed your instructions using log-level 3 but I don’t see any information about why the heuristic is failing to finalize the engine configuration. I have pasted what I believe are the relevant information logs. These functions are for the creation, setup, finalization and retrieval of the count from the heuristic using mode A:

I! CuDNN (v90101 17) function cudnnBackendCreateDescriptor() called:
i!     descriptorType: type=cudnnBackendDescriptorType_t; val=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR (4);
i! Time: 2025-01-03T13:17:30.734930 (0d+0h+0m+0s since start)
i! Process=3026944; Thread=3026944; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90101 17) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_ENGINEHEUR_OPERATION_GRAPH (201);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR; val=0x1e5c2f0;
i!         : type=bool; val=true;
i! Time: 2025-01-03T13:17:30.734951 (0d+0h+0m+0s since start)
i! Process=3026944; Thread=3026944; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90101 17) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_ENGINEHEUR_MODE (200);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_HEUR_MODE (8);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=int; val=3;
i! Time: 2025-01-03T13:17:30.734974 (0d+0h+0m+0s since start)
i! Process=3026944; Thread=3026944; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90101 17) function cudnnBackendFinalize() called:
i!     descriptor: type=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR; val=NOT_IMPLEMENTED;
i! Time: 2025-01-03T13:17:30.734980 (0d+0h+0m+0s since start)
i! Process=3026944; Thread=3026944; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90101 17) function cudnnBackendGetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_ENGINEHEUR_RESULTS (202);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     requestedElementCount: type=int64_t; val=0;
i!     elementCount: location=host; addr=0x7ffd79a1bd98;
i!     arrayOfElements: location=host; addr=NULL_PTR;
i! Time: 2025-01-03T13:17:30.735015 (0d+0h+0m+0s since start)
i! Process=3026944; Thread=3026944; GPU=NULL; Handle=NULL; StreamId=NULL.

It seems like my error messages regarding engine configurations all appear when finalizing the execution plan. I am using an L40S.

When finalizing execution plans using the five engine configurations returned with CUDNN_HEUR_MODE_FALLBACK warnings are logged with these reasons for the failure to finalize the execution plan.

  • CUDNN_STATUS_NOT_SUPPORTED_ARCH_MISMATCH
  • CUDNN_STATUS_NOT_SUPPORTED_ARCH_MISMATCH
  • CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: is_multiple_gemms_acc_fusion
  • CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: Fused Attention patterns are not supported by this engine at: is_fmha
  • CUDNN_STATUS_NOT_SUPPORTED_ARCH_MISMATCH

I also get one engine configuration returned while using CUDNN_HEUR_MODE_A which fails with

  • CUDNN_STATUS_NOT_SUPPORTED_SHARED_MEMORY_INSUFFICIENT

However, cudaMemGetInfo returns that I have 47GB free. From my understanding of the heuristic methods if it is possible to execute the graph, FALLBACK should return an engine configuration that works. Meaning that the Fused Attention fprop graph pattern isn’t supported on my device as you stated. I just wanted to be able recreate thee reasons with my program.

In order to support MHA fprop in the general case would I have to create two operation graphs with separate engine configs and execution plans? One for scaling->matmul and another for softmax->matmul?

Ok, according to this site NVIDIA L40S GPU the L40S is Ada Lovelace Architecture.

Since you are using CuDnn 9.1.1 → that could explain why you get diffrent error-messages. The newer versions of CuDNN have improved error-handling/logging. I.E. Version 9.3: " Error messages generated during retrieval of the CUDNN_ATTR_ENGINEHEUR_RESULTS attribute are accessible through the cudnnGetLastErrorString() function." from Release Notes — NVIDIA cuDNN

I am currently at this point, my best guess is: Yes. Maybe you want to join this discussion: Conceptual questions about how to use cuDNN v9

Yes I will join that discussion. Thank you so much, you have been very helpful.