Using cuDNN Backend to Create a Fused Attention fprop Graph

4imothy · December 30, 2024, 7:18am

I am working on implementing the Fused Attention fprop. As of now I am only combining two matrix multiplications, meaning g3 and g4 are empty. I believe I have also matched all the requirements for this graph but none of the engine configurations provided by the engine heuristic work when passed to an execution plan. When finalizing the exec plan using any of the engine configurations the status CUDNN_STATUS_NOT_SUPPORTED is returned.

I have pasted the implementation I am using as well as the shapes and strides of all tensors used. Please let me know what I am doing incorrectly or any other information to help.

qshape: 4: 1 1 10 64 
qstride: 4: 640 640 64 1 
kshape: 4: 1 1 64 10 
kstride: 4: 640 640 1 64 
sshape: 4: 1 1 10 10 
sstride: 4: 100 100 10 1 
vshape: 4: 1 1 10 64 
vstride: 4: 640 640 64 1 
oshape: 4: 1 1 10 64 
ostride: 4: 640 640 64 1

#include <cudnn.h>
#include <iostream>
#include <vector>

#define CUDNN_CHECK(status)                                                    \
    {                                                                          \
        if (status != CUDNN_STATUS_SUCCESS) {                                  \
            fprintf(stderr, "cuDNN error: %s:%d:%s\n", __FILE__, __LINE__,     \
                    cudnnGetErrorString(status));                              \
            std::exit(EXIT_FAILURE);                                           \
        }                                                                      \
    }

void print_vector(const std::vector<int64_t> &v, std::string name) {
    std::cout << name << ": " << v.size() << ": ";
    for (int64_t i : v) {
        std::cout << i << " ";
    }
    std::cout << std::endl;
}

std::vector<int64_t> standard_4d_strides(const std::vector<int64_t> &shape) {
    return {shape[1] * shape[2] * shape[3], shape[2] * shape[3], shape[3], 1};
}

cudnnBackendDescriptor_t
tensor_descriptor(const std::vector<int64_t> &shape,
                  const std::vector<int64_t> &strides, int64_t id,
                  cudnnDataType_t data_type, int64_t byte_alignment,
                  bool is_virtual, bool reordering_fp16x16 = false) {
    cudnnBackendDescriptor_t desc;
    CUDNN_CHECK(
        cudnnBackendCreateDescriptor(CUDNN_BACKEND_TENSOR_DESCRIPTOR, &desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(desc, CUDNN_ATTR_TENSOR_UNIQUE_ID,
                                         CUDNN_TYPE_INT64, 1, &id));
    CUDNN_CHECK(cudnnBackendSetAttribute(desc, CUDNN_ATTR_TENSOR_DATA_TYPE,
                                         CUDNN_TYPE_DATA_TYPE, 1, &data_type));
    CUDNN_CHECK(cudnnBackendSetAttribute(desc, CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
                                         CUDNN_TYPE_INT64, 1, &byte_alignment));
    CUDNN_CHECK(cudnnBackendSetAttribute(desc, CUDNN_ATTR_TENSOR_DIMENSIONS,
                                         CUDNN_TYPE_INT64,
                                         (int64_t)shape.size(), shape.data()));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        desc, CUDNN_ATTR_TENSOR_STRIDES, CUDNN_TYPE_INT64,
        (int64_t)strides.size(), strides.data()));
    CUDNN_CHECK(cudnnBackendSetAttribute(desc, CUDNN_ATTR_TENSOR_IS_VIRTUAL,
                                         CUDNN_TYPE_BOOLEAN, 1, &is_virtual));
    if (reordering_fp16x16) {
        cudnnBackendTensorReordering_t reorder = CUDNN_TENSOR_REORDERING_F16x16;
        CUDNN_CHECK(cudnnBackendSetAttribute(
            desc, CUDNN_ATTR_TENSOR_REORDERING_MODE,
            CUDNN_TYPE_TENSOR_REORDERING_MODE, 1, &reorder));
    }
    CUDNN_CHECK(cudnnBackendFinalize(desc));
    return desc;
}

int main() {
    std::vector<int64_t> shape_query = {1, 1, 10, 64};
    std::vector<int64_t> strides_query = standard_4d_strides(shape_query);
    std::vector<int64_t> shape_key = {1, 1, 10, 64};
    std::vector<int64_t> strides_key = standard_4d_strides(shape_key);
    std::swap(shape_key[2], shape_key[3]);
    std::swap(strides_key[2], strides_key[3]);
    std::vector<int64_t> shape_value = {1, 1, 10, 64};
    std::vector<int64_t> strides_value = standard_4d_strides(shape_value);

    std::vector<int64_t> shape_scores = {shape_query[0], shape_query[1],
                                         shape_query[2], shape_key[3]};
    std::vector<int64_t> strides_scores = standard_4d_strides(shape_scores);

    std::vector<int64_t> shape_output = {shape_query[0], shape_query[1],
                                         shape_query[2], shape_value[3]};
    std::vector<int64_t> strides_output = standard_4d_strides(shape_output);
    cudnnHandle_t handle;
    CUDNN_CHECK(cudnnCreate(&handle));
    cudnnDataType_t comp_type = CUDNN_DATA_FLOAT;
    cudnnDataType_t data_type = CUDNN_DATA_HALF;
    int64_t data_type_byte_alignment = 2;

    cudnnBackendDescriptor_t query_desc =
        tensor_descriptor(shape_query, strides_query, 'q', data_type,
                          data_type_byte_alignment, false);
    cudnnBackendDescriptor_t key_desc =
        tensor_descriptor(shape_key, strides_key, 'k', data_type,
                          data_type_byte_alignment, false);
    cudnnBackendDescriptor_t value_desc =
        tensor_descriptor(shape_value, strides_value, 'v', data_type,
                          data_type_byte_alignment, false);
    cudnnBackendDescriptor_t scores_desc =
        tensor_descriptor(shape_scores, strides_scores, 's', data_type,
                          data_type_byte_alignment, true, true);
    cudnnBackendDescriptor_t output_desc =
        tensor_descriptor(shape_output, strides_output, 'o', data_type,
                          data_type_byte_alignment, false);

    print_vector(shape_query, "qshape");
    print_vector(strides_query, "qstride");
    print_vector(shape_key, "kshape");
    print_vector(strides_key, "kstride");
    print_vector(shape_scores, "sshape");
    print_vector(strides_scores, "sstride");
    print_vector(shape_value, "vshape");
    print_vector(strides_value, "vstride");
    print_vector(shape_output, "oshape");
    print_vector(strides_output, "ostride");

    cudnnBackendDescriptor_t matmul_desc;
    CUDNN_CHECK(cudnnBackendCreateDescriptor(CUDNN_BACKEND_MATMUL_DESCRIPTOR,
                                             &matmul_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(matmul_desc,
                                         CUDNN_ATTR_MATMUL_COMP_TYPE,
                                         CUDNN_TYPE_DATA_TYPE, 1, &comp_type));
    CUDNN_CHECK(cudnnBackendFinalize(matmul_desc));
    cudnnBackendDescriptor_t op_matmul;
    CUDNN_CHECK(cudnnBackendCreateDescriptor(
        CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR, &op_matmul));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_matmul, CUDNN_ATTR_OPERATION_MATMUL_DESC,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &matmul_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_matmul, CUDNN_ATTR_OPERATION_MATMUL_ADESC,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &query_desc));
    CUDNN_CHECK(
        cudnnBackendSetAttribute(op_matmul, CUDNN_ATTR_OPERATION_MATMUL_BDESC,
                                 CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &key_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_matmul, CUDNN_ATTR_OPERATION_MATMUL_CDESC,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &scores_desc));
    CUDNN_CHECK(cudnnBackendFinalize(op_matmul));

    cudnnBackendDescriptor_t out_matmul_desc;
    CUDNN_CHECK(cudnnBackendCreateDescriptor(CUDNN_BACKEND_MATMUL_DESCRIPTOR,
                                             &out_matmul_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(out_matmul_desc,
                                         CUDNN_ATTR_MATMUL_COMP_TYPE,
                                         CUDNN_TYPE_DATA_TYPE, 1, &comp_type));
    CUDNN_CHECK(cudnnBackendFinalize(out_matmul_desc));
    cudnnBackendDescriptor_t op_matmul_output;
    CUDNN_CHECK(cudnnBackendCreateDescriptor(
        CUDNN_BACKEND_OPERATION_MATMUL_DESCRIPTOR, &op_matmul_output));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_matmul_output, CUDNN_ATTR_OPERATION_MATMUL_DESC,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &out_matmul_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_matmul_output, CUDNN_ATTR_OPERATION_MATMUL_ADESC,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &scores_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_matmul_output, CUDNN_ATTR_OPERATION_MATMUL_BDESC,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &value_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_matmul_output, CUDNN_ATTR_OPERATION_MATMUL_CDESC,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &output_desc));
    CUDNN_CHECK(cudnnBackendFinalize(op_matmul_output));

    cudnnBackendDescriptor_t op_graph;
    cudnnBackendDescriptor_t ops[] = {op_matmul, op_matmul_output};
    CUDNN_CHECK(cudnnBackendCreateDescriptor(
        CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR, &op_graph));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        op_graph, CUDNN_ATTR_OPERATIONGRAPH_OPS, CUDNN_TYPE_BACKEND_DESCRIPTOR,
        sizeof(ops) / sizeof(ops[0]), ops));
    CUDNN_CHECK(cudnnBackendSetAttribute(op_graph,
                                         CUDNN_ATTR_OPERATIONGRAPH_HANDLE,
                                         CUDNN_TYPE_HANDLE, 1, &handle));
    CUDNN_CHECK(cudnnBackendFinalize(op_graph));

    cudnnBackendDescriptor_t heur_desc;
    CUDNN_CHECK(cudnnBackendCreateDescriptor(
        CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR, &heur_desc));
    CUDNN_CHECK(cudnnBackendSetAttribute(
        heur_desc, CUDNN_ATTR_ENGINEHEUR_OPERATION_GRAPH,
        CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &op_graph));
    cudnnBackendHeurMode_t heur_mode = CUDNN_HEUR_MODE_FALLBACK;
    CUDNN_CHECK(cudnnBackendSetAttribute(heur_desc, CUDNN_ATTR_ENGINEHEUR_MODE,
                                         CUDNN_TYPE_HEUR_MODE, 1, &heur_mode));
    CUDNN_CHECK(cudnnBackendFinalize(heur_desc));
    int64_t count = 0;
    CUDNN_CHECK(cudnnBackendGetAttribute(
        heur_desc, CUDNN_ATTR_ENGINEHEUR_RESULTS, CUDNN_TYPE_BACKEND_DESCRIPTOR,
        0, &count, NULL));
    std::vector<cudnnBackendDescriptor_t> eng_cfgs(count);
    for (cudnnBackendDescriptor_t &cfg : eng_cfgs) {
        CUDNN_CHECK(cudnnBackendCreateDescriptor(
            CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, &cfg));
    }
    CUDNN_CHECK(cudnnBackendGetAttribute(
        heur_desc, CUDNN_ATTR_ENGINEHEUR_RESULTS, CUDNN_TYPE_BACKEND_DESCRIPTOR,
        count, nullptr, eng_cfgs.data()));

    for (cudnnBackendDescriptor_t &cfg : eng_cfgs) {
        cudnnBackendDescriptor_t exec_plan;
        CUDNN_CHECK(cudnnBackendCreateDescriptor(
            CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR, &exec_plan));
        CUDNN_CHECK(cudnnBackendSetAttribute(
            exec_plan, CUDNN_ATTR_EXECUTION_PLAN_ENGINE_CONFIG,
            CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &cfg));
        CUDNN_CHECK(cudnnBackendSetAttribute(exec_plan,
                                             CUDNN_ATTR_EXECUTION_PLAN_HANDLE,
                                             CUDNN_TYPE_HANDLE, 1, &handle));
        cudnnStatus_t status = cudnnBackendFinalize(exec_plan);
        std::cout << cudnnGetErrorString(status) << "\n";
        if (status == CUDNN_STATUS_SUCCESS) {
            std::cout << "success\n";
        }
    }

    // To be filled in

    return 0;
}

kiwimanshare · January 3, 2025, 2:23pm

Hello,

What kind of device do you have?
Maybe you should check the log or read the last cudnn error.

There will probbably be a Warning like this:

i!     engineConfig: type=json; val={"engineId":7,"smVersion":890,"knobChoices":{"CUDNN_KNOB_TYPE_SPLIT_K_SLC":-1,"CUDNN_KNOB_TYPE_KERNEL_CFG":25}};
i!     Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: Multi-GEMM is only supported on Hopper at: receive_op_->target != fort_device::SM90
i!     Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: check_matmul_support_fort(node)
i!     Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: check_node_support_fort(node_ptr)
i!     Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: check_for_support()
i!     Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: (CUDNN_STATUS_NOT_INITIALIZED == status) || (CUDNN_STATUS_NOT_SUPPORTED == ((status) / 1000 * 1000))
i!     Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: ptr->isSupported()
i!     Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: finalize_internal()

Note:

Reason: Multi-GEMM is only supported on Hopper at: receive_op_->target != fort_device::SM90

Hope this helps.

4imothy · January 3, 2025, 4:18pm

Thank you for your reply.

I am using cuDNN 9.1.1, I’ve tried running the program with logging enabled to capture error messages, but it’s not providing much additional information.

When logging only errors and warnings, nothing is reported. Logging everything doesn’t show anything irregular about the function calls. When using CUDNN_HEUR_MODE_FALLBACK as the heuristic mode, none of the execution plans using the engine configurations returned by the heuristic successfully finalize. When using CUDNN_HEUR_MODE_A, 0 engine configurations are returned. I’ve tried using mode A, as it is reported to support patterns compatible with the runtime fusion engine.

When the status CUDNN_STATUS_NOT_SUPPORTED is returned during finalization of the execution plans, cudnnGetLastErrorString is empty.

Is there another way to gain insight into why the execution plans are failing to finalize?

kiwimanshare · January 3, 2025, 5:32pm

The failing of the execution plan comes from the failing of finalizing the engineconfig.

Is there another way to gain insight into why the execution plans are failing to finalize?

yes, with logging.

Here is a Link to the documentation, showing how to use logging: Troubleshooting — NVIDIA cuDNN

Set Log-Level to 3 and send the log to a file, i.E. “CUDNN_LOGDEST_DBG=filename.txt” (if you use systemvariables, a system restart may be required). The log will show why the engineConfig could not be created in a more detailed way (see my last post).
There is no error when Heuristic fails to finalize a engineconfig, but a information, therefore log-level 3 is required.

However, according to your test-setup (two MatMul in a single graph) I’d guess you run into this problem, because your device does not support Multi-GEMM (Multiple MatMul Operation in a single graph). If you do not have a hopper device (i.e. H100 or H200), this is not supported ( CUDNN_STATUS_NOT_SUPPORTED).

Try to create a graph with multiple pointwise-operations, that should work.

4imothy · January 3, 2025, 8:05pm

I followed your instructions using log-level 3 but I don’t see any information about why the heuristic is failing to finalize the engine configuration. I have pasted what I believe are the relevant information logs. These functions are for the creation, setup, finalization and retrieval of the count from the heuristic using mode A:

I! CuDNN (v90101 17) function cudnnBackendCreateDescriptor() called:
i!     descriptorType: type=cudnnBackendDescriptorType_t; val=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR (4);
i! Time: 2025-01-03T13:17:30.734930 (0d+0h+0m+0s since start)
i! Process=3026944; Thread=3026944; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90101 17) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_ENGINEHEUR_OPERATION_GRAPH (201);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR; val=0x1e5c2f0;
i!         : type=bool; val=true;
i! Time: 2025-01-03T13:17:30.734951 (0d+0h+0m+0s since start)
i! Process=3026944; Thread=3026944; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90101 17) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_ENGINEHEUR_MODE (200);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_HEUR_MODE (8);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=int; val=3;
i! Time: 2025-01-03T13:17:30.734974 (0d+0h+0m+0s since start)
i! Process=3026944; Thread=3026944; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90101 17) function cudnnBackendFinalize() called:
i!     descriptor: type=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR; val=NOT_IMPLEMENTED;
i! Time: 2025-01-03T13:17:30.734980 (0d+0h+0m+0s since start)
i! Process=3026944; Thread=3026944; GPU=NULL; Handle=NULL; StreamId=NULL.


I! CuDNN (v90101 17) function cudnnBackendGetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_ENGINEHEUR_RESULTS (202);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     requestedElementCount: type=int64_t; val=0;
i!     elementCount: location=host; addr=0x7ffd79a1bd98;
i!     arrayOfElements: location=host; addr=NULL_PTR;
i! Time: 2025-01-03T13:17:30.735015 (0d+0h+0m+0s since start)
i! Process=3026944; Thread=3026944; GPU=NULL; Handle=NULL; StreamId=NULL.

It seems like my error messages regarding engine configurations all appear when finalizing the execution plan. I am using an L40S.

When finalizing execution plans using the five engine configurations returned with CUDNN_HEUR_MODE_FALLBACK warnings are logged with these reasons for the failure to finalize the execution plan.

CUDNN_STATUS_NOT_SUPPORTED_ARCH_MISMATCH
CUDNN_STATUS_NOT_SUPPORTED_ARCH_MISMATCH
CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: is_multiple_gemms_acc_fusion
CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: Fused Attention patterns are not supported by this engine at: is_fmha
CUDNN_STATUS_NOT_SUPPORTED_ARCH_MISMATCH

I also get one engine configuration returned while using CUDNN_HEUR_MODE_A which fails with

CUDNN_STATUS_NOT_SUPPORTED_SHARED_MEMORY_INSUFFICIENT

However, cudaMemGetInfo returns that I have 47GB free. From my understanding of the heuristic methods if it is possible to execute the graph, FALLBACK should return an engine configuration that works. Meaning that the Fused Attention fprop graph pattern isn’t supported on my device as you stated. I just wanted to be able recreate thee reasons with my program.

In order to support MHA fprop in the general case would I have to create two operation graphs with separate engine configs and execution plans? One for scaling->matmul and another for softmax->matmul?

kiwimanshare · January 3, 2025, 9:00pm

Ok, according to this site NVIDIA L40S GPU the L40S is Ada Lovelace Architecture.

Since you are using CuDnn 9.1.1 → that could explain why you get diffrent error-messages. The newer versions of CuDNN have improved error-handling/logging. I.E. Version 9.3: " Error messages generated during retrieval of the CUDNN_ATTR_ENGINEHEUR_RESULTS attribute are accessible through the cudnnGetLastErrorString() function." from Release Notes — NVIDIA cuDNN

I am currently at this point, my best guess is: Yes. Maybe you want to join this discussion: Conceptual questions about how to use cuDNN v9

4imothy · January 3, 2025, 10:01pm

Yes I will join that discussion. Thank you so much, you have been very helpful.

Topic		Replies	Views
Matrix Multiplication -> PointWise Operation is Always Read as an MHA Pattern cuDNN cudnn	0	20	January 4, 2025
Errors Occurred When Using CUDNN Matmul on V100 cuDNN cudnn	2	687	May 8, 2024
Error creating engineConfig with cuDNN Graph API in custom implementation cuDNN cudnn	1	24	August 30, 2024
cudnn lstm is broken above driver 431.60, 'Unexpected Event status: 1 cuda' cuDNN	14	8711	February 4, 2021
All algo is CUDNN_STATUS_NOT_SUPPORTED cuDNN	7	1592	October 30, 2022
CUDNN_STATUS_NOT_SUPPORTED when finalizing engine config for simple ReLU graph cuDNN cudnn	2	68	December 3, 2024
Cudnn Backend Resample examples cuDNN	4	1030	June 21, 2022
Cudnn 7.6.5 conv_sample running in half + nhwc + tensorcore + 2080ti, it gives out of bound error cuDNN	2	589	August 24, 2021
MultiHeadAttnForward Result cuDNN cuda	7	1206	January 13, 2023
Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR Frameworks tensorflow	1	1363	May 18, 2020

Using cuDNN Backend to Create a Fused Attention fprop Graph

Related topics