CUDNN_ATTR_TENSOR_IS_BY_VALUE

In cudnn_graph Library — NVIDIA cuDNN Backend
The API for CUDNN_ATTR_OPERATION_NORM_FWD_EPSILON_DESC states

Scalar input tensor descriptor for the epsilon value used in
normalization calculation. Note that the attribute
CUDNN_ATTR_TENSOR_IS_BY_VALUE of this descriptor should be set to
true.

  1. How do I set the CUDNN_ATTR_TENSOR_IS_BY_VALUE attribute to true?
    I.e. what do I pass for … to

    cudnnBackendSetAttribute(epsilonDesc,
    CUDNN_ATTR_TENSOR_IS_BY_VALUE,
    …,
    1,
    …));

  2. What is a “scalar tensor”?
    I.e. what should nDims, epsilonDim, epsilonStr, and alignment be
    in the following:

    cudnnBackendDescriptor_t epsilonDesc;
    cudnnBackendCreateDescriptor(CUDNN_BACKEND_TENSOR_DESCRIPTOR, &epsilonDesc);
    cudnnDataType_t dtype = CUDNN_DATA_FLOAT;
    cudnnBackendSetAttribute(epsilonDesc,
    CUDNN_ATTR_TENSOR_DATA_TYPE,
    CUDNN_TYPE_DATA_TYPE, 1, &dtype);
    int64_t nDims = 4;
    int64_t epsilonDim[4] = {1, 1, 1, 1};
    int64_t epsilonStr[4] = {1, 1, 1, 1};
    int64_t epsilonUi = ‘e’;
    int64_t alignment = 16;
    cudnnBackendSetAttribute(epsilonDesc,
    CUDNN_ATTR_TENSOR_DIMENSIONS,
    CUDNN_TYPE_INT64, nDims, epsilonDim);
    cudnnBackendSetAttribute(epsilonDesc,
    CUDNN_ATTR_TENSOR_STRIDES,
    CUDNN_TYPE_INT64, nDims, epsilonStr);
    cudnnBackendSetAttribute(epsilonDesc,
    CUDNN_ATTR_TENSOR_UNIQUE_ID,
    CUDNN_TYPE_INT64, 1, &epsilonUi);
    cudnnBackendSetAttribute(epsilonDesc,
    CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
    CUDNN_TYPE_INT64, 1, &alignment);

Reposting with the code formatted.

In the description of CUDNN_ATTR_OPERATION_NORM_FWD_EPSILON_DESC in

https://docs.nvidia.com/deeplearning/cudnn/backend/latest/api/cudnn-graph-library.html#cudnn-graph-library

it says:

Scalar input tensor descriptor for the epsilon value used in
normalization calculation. Note that the attribute
CUDNN_ATTR_TENSOR_IS_BY_VALUE of this descriptor should be set to
true.

How does one set the CUDNN_ATTR_TENSOR_IS_BY_VALUE attributed to true?
What values do I pass in for TYPE and VALUE in the following:

   CUDNN_CHECK(cudnnBackendSetAttribute(xDesc, CUDNN_ATTR_TENSOR_IS_BY_VALUE,
                                        TYPE, 1, VALUE));

More generally, what is a “scalar tensor”? How does one create one?
What should nDims, xDim, and XStr be in the following:

  cudnnBackendDescriptor_t xDesc;
  CUDNN_CHECK(cudnnBackendCreateDescriptor
	      (CUDNN_BACKEND_TENSOR_DESCRIPTOR, &xDesc));
  cudnnDataType_t dtype = CUDNN_DATA_FLOAT;
  CUDNN_CHECK(cudnnBackendSetAttribute(xDesc, CUDNN_ATTR_TENSOR_DATA_TYPE,
				       CUDNN_TYPE_DATA_TYPE, 1, &dtype));
  int64_t nDims = 4;
  int64_t xDim[4] = {1, 1, 1, 1};
  int64_t xStr[4] = {1, 1, 1, 1};
  int64_t xUi = 'e';
  int64_t alignment = 16;
  CUDNN_CHECK(cudnnBackendSetAttribute(xDesc, CUDNN_ATTR_TENSOR_DIMENSIONS,
				       CUDNN_TYPE_INT64, nDims, xDim));
  CUDNN_CHECK(cudnnBackendSetAttribute(xDesc, CUDNN_ATTR_TENSOR_STRIDES,
				       CUDNN_TYPE_INT64, nDims, xStr));
  CUDNN_CHECK(cudnnBackendSetAttribute(xDesc, CUDNN_ATTR_TENSOR_UNIQUE_ID,
				       CUDNN_TYPE_INT64, 1, &xUi));
  CUDNN_CHECK(cudnnBackendSetAttribute
	      (xDesc, CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
	       CUDNN_TYPE_INT64, 1, &alignment));
  CUDNN_CHECK(cudnnBackendSetAttribute(xDesc, CUDNN_ATTR_TENSOR_IS_BY_VALUE,
				       CUDNN_TYPE_INT64, 1, &xUi));
  CUDNN_CHECK(cudnnBackendFinalize(xDesc));

Also, I get the following when trying to set up layer normalization:

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_MODE (2000);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_NORM_MODE (27);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=int; val=0;
i! Time: 2025-04-28T21:27:51.992003 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_PHASE (2001);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_NORM_FWD_PHASE (28);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=int; val=1;
i! Time: 2025-04-28T21:27:51.992044 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_XDESC (2002);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_TENSOR_DESCRIPTOR; val=0x560694020c00;
i!         : type=bool; val=true;
i! Time: 2025-04-28T21:27:51.992086 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_MEAN_DESC (2003);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_TENSOR_DESCRIPTOR; val=0x56069533cbc0;
i!         : type=bool; val=true;
i! Time: 2025-04-28T21:27:51.992138 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_INV_VARIANCE_DESC (2004);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_TENSOR_DESCRIPTOR; val=0x5606952cbe20;
i!         : type=bool; val=true;
i! Time: 2025-04-28T21:27:51.992183 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_SCALE_DESC (2005);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_TENSOR_DESCRIPTOR; val=0x5606952cbf50;
i!         : type=bool; val=true;
i! Time: 2025-04-28T21:27:51.992228 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_BIAS_DESC (2006);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_TENSOR_DESCRIPTOR; val=0x5606952cc080;
i!         : type=bool; val=true;
i! Time: 2025-04-28T21:27:51.992274 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_EPSILON_DESC (2007);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_TENSOR_DESCRIPTOR; val=0x5606952cc1b0;
i!         : type=bool; val=true;
i! Time: 2025-04-28T21:27:51.992319 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_YDESC (2013);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_TENSOR_DESCRIPTOR; val=0x5606952cc2e0;
i!         : type=bool; val=true;
i! Time: 2025-04-28T21:27:51.992360 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendFinalize() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i! Time: 2025-04-28T21:27:51.992413 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

Does this mean that layer normalization is not implemented?

Hi @qobi

LayerNorm is supported in cuDNN. To make it, and other prominent operations accessible, we have c++ front-end wrapper to the backend available along with sample code. This should be easier to use. Please check it out here: cudnn-frontend/samples/cpp/norm at main · NVIDIA/cudnn-frontend · GitHub . We are happy to answer backend API questions if for some reason the front-end wrapper won’t work for your use case.

Best,

Sophie

Thanks. I need to integrate layer normalization into a large program that is written in C, not C++. That is why I am using the backend, not the frontend wrapper. It would be helpful to know how to get the backend code working.

I enclose a snippet of my code. I basically took it from the NVidia web page doccumentation for cuDNN.

Thanks,
Jeff

graph-api-example.txt (18.8 KB)

Hi Jeff,

Does the following example code-snippet unblock you?

bool xByValue = true;
CUDNN_CHECK(cudnnBackendSetAttribute(xDesc, CUDNN_ATTR_TENSOR_IS_BY_VALUE, CUDNN_TYPE_BOOLEAN, 1, &xByValue));

– Seth

Thanks. That helps.

As an aside, C doesn’t have the type bool and the constant true.
So I did:

#define TRUE (0==0)
int64_t epsilonByValue = TRUE;
CUDNN_CHECK(cudnnBackendSetAttribute(epsilonDesc, CUDNN_ATTR_TENSOR_IS_BY_VALUE,
                                     CUDNN_TYPE_BOOLEAN, 1, &epsilonByValue));

I presume this is still correct.

Now I get:

qobi@poto>./graph-api-example
before op_graph

W! CuDNN (v90800 87) function cudnnBackendFinalize() called:
w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: !patternMatchOptionalLinearDAG(this->getEntranceNodes().front(), userGraph->getEntranceNodes().front())
w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: referenceGraph->patternMatchLinearDAG(userGraph)
w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: LinearPatternMatcher::matchPattern(userGraph)
w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: matchPattern(opg.getGraph())
w! Time: 2025-04-30T20:06:05.217680 (0d+0h+0m+4s since start)
w! Process=1098065; Thread=1098065; GPU=NULL; Handle=NULL; StreamId=NULL.

after op_graph
before engcfg

E! CuDNN (v90800 87) function cudnnBackendFinalize() called:
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: fillup_num_execution_groups()
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: status
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: ptr->isSupported()
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: finalize_internal()
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: ptrDesc->finalize()
e! Time: 2025-04-30T20:06:05.217805 (0d+0h+0m+4s since start)
e! Process=1098065; Thread=1098065; GPU=NULL; Handle=NULL; StreamId=NULL.

Failed: cuDNN error graph-api-example.c:319 'CUDNN_STATUS_EXECUTION_FAILED'
qobi@poto>

I presume this means that my constructed graph doesn’t match one of
the standard patterns that it has a kernel for. But my code only has
one node which is a layer normalization. I don’t know what aspect of
this doesn’t match the known kernels. If you look at the top of the
file you see what my understanding of the requirements are by reading
an old version of the docs. I can’t find the equivalent in the new
version of the docs.

The current code is enclosed.

graph-api-example.txt (18.9 KB)

What do I need to change?

Thanks,
Jeff

I think the obstacles are:

  • You need to set gidx to 1 (more on that in a moment).
  • Set mean, inv. var descriptors on the bwds operation.
  • Allow more than 0 bytes workspace (128 worked locally for me)

If you modify like the above, then both cudnn_ln and cudnn_dlndx12b should return “success”. I did not modify things further to execute.

A few things:

  • The performance won’t be what you want as-is: you should be querying heuristics for your workloads (HEUR_MODE_A). This will allow you to move beyond hard-coding gidx as above (which should not be done generally).
  • You’ll eventually need to get the workspace from the finalized plan, allocate it, and pass it back (as for gidx, hardcoding is not recommended).

The overall API sequence of the c++ front-end wrapper is probably what you want here. I know you can’t use it directly, but it’s fundamentally just a wrapper – all the backend calls used here are used by it also.

I hope this helps & unblocks you, but if not, update here, and we’ll try to help.

Thanks for your help. I got the forward pass to work Will work on the
backward pass later this afternoon. Several more questions:

  1. cudnn_graph Library — NVIDIA cuDNN Backend
    Setting Up An Engine Configuration
    has example code snippet 1 which sets gidx to 0 which is why I did that.
    Why does it need to be 1 instead of 0?
    Is there documentation that specifies what the possible values are
    and what they mean?
    If the docs are wrong, perhaps change them.

  2. cudnn_graph Library — NVIDIA cuDNN Backend
    CUDNN_ATTR_OPERATION_NORM_BWD_MEAN_DESC
    and
    CUDNN_ATTR_OPERATION_NORM_BWD_INV_VARIANCE_DESC
    both say that they are Optional attributes. This is why I did not
    include them. In the old API for batch normalization
    (cudnnBatchNormalizationBackward) these were optional and just
    made things faster (at the expense of some more memory). Are these
    required for layer normalization with the graph API? If so,
    perhaps change the docs.

Thanks,
Jeff

Thanks. I have several more questions. All my questions refer to the
documentation at:

cudnn_graph Library — NVIDIA cuDNN Backend

  1. The docs say CUDNN_HEUR_MODE_A and CUDNN_HEUR_MODE_B each support
    a specific list of operation nodes or graphs. Neither of them list
    layer normalization. You suggested that I use the heuristic modes.
    What do I do?

  2. cudnnBackendGetAttribute of CUDNN_ATTR_ENGINEHEUR_RESULTS on an
    CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR returns an
    CUDNN_BACKEND_ENGINECFG_DESCRIPTOR. Do I first have to create the
    CUDNN_BACKEND_ENGINECFG_DESCRIPTOR with cudnnBackendCreateDescriptor
    or does CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR do that for me? I.e.
    after doing:

  CUDNN_CHECK(cudnnBackendCreateDescriptor
    (CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR, &engheur));
  CUDNN_CHECK(cudnnBackendSetAttribute
    (engheur, CUDNN_ATTR_ENGINEHEUR_OPERATION_GRAPH,
     CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &op_graph));
  CUDNN_CHECK(cudnnBackendSetAttribute
    (engheur, CUDNN_ATTR_ENGINEHEUR_MODE,
     CUDNN_TYPE_HEUR_MODE, 1, &heur_mode));
  CUDNN_CHECK(cudnnBackendFinalize(engheur));

should I do:

 CUDNN_CHECK(cudnnBackendCreateDescriptor
    (CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, &engcfg));

before I do:

  CUDNN_CHECK(cudnnBackendGetAttribute
    (engheur, CUDNN_ATTR_ENGINEHEUR_RESULTS,
   CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, NULL, &engcfg));
  CUDNN_CHECK(cudnnBackendSetAttribute
    (engcfg, CUDNN_ATTR_ENGINECFG_ENGINE,
   CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &engine));
  CUDNN_CHECK(cudnnBackendFinalize(engcfg));

Do I have to? The docs don’t say.

  1. Should I do:
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(engcfg));

on the CUDNN_BACKEND_ENGINECFG_DESCRIPTOR returned by
cudnnBackendGetAttribute of CUDNN_ATTR_ENGINEHEUR_RESULTS on an
CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR? Do I have to? The docs don’t
say.

  1. If I run the enclosed graph-api-example-v2 I get:
  before op_graph

  W! CuDNN (v90800 87) function cudnnBackendFinalize() called:
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: !patternMatchOptionalLinearDAG(this->getEntranceNodes().front(), userGraph->getEntranceNodes().front())
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: referenceGraph->patternMatchLinearDAG(userGraph)
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: LinearPatternMatcher::matchPattern(userGraph)
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: matchPattern(opg.getGraph())
  w! Time: 2025-05-02T07:47:53.837436 (0d+0h+0m+3s since start)
  w! Process=1363821; Thread=1363821; GPU=NULL; Handle=NULL; StreamId=NULL.

  after op_graph
  cudnn_ln success
  before op_graph

  W! CuDNN (v90800 87) function cudnnBackendFinalize() called:
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: !patternMatchOptionalLinearDAG(this->getEntranceNodes().front(), userGraph->getEntranceNodes().front())
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: referenceGraph->patternMatchLinearDAG(userGraph)
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: LinearPatternMatcher::matchPattern(userGraph)
  w!         Warning: CUDNN_STATUS_NOT_S UPPORTED_GRAPH_PATTERN; Reason: matchPattern(opg.getGraph())
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: Non-pointwise ops before the forward norm op are not supported at: node_ptr->opType != CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
  w! Time: 2025-05-02T07:47:53.987149 (0d+0h+0m+3s since start)
  w! Process=1363821; Thread=1363821; GPU=NULL; Handle=NULL; StreamId=NULL.

  after op_graph
  cudnn_dlndx12b success

Why?

  1. If I run the enclosed graph-api-example-v3 I get:
  before op_graph

  W! CuDNN (v90800 87) function cudnnBackendFinalize() called:
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: !patternMatchOptionalLinearDAG(this->getEntranceNodes().front(), userGraph->getEntranceNodes().front())
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: referenceGraph->patternMatchLinearDAG(userGraph)
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: LinearPatternMatcher::matchPattern(userGraph)
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: matchPattern(opg.getGraph())
  w! Time: 2025-05-02T07:49:35.236680 (0d+0h+0m+4s since start)
  w! Process=1364076; Thread=1364076; GPU=NULL; Handle=NULL; StreamId=NULL.

  after op_graph
  before engine

  W! CuDNN (v90800 87) function cudnnBackendFinalize() called:
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: The specified engine with global index -1 is not found for NORM_FWD_TRAIN(18) at: CUDNN_INVALID_PLACE_HOLDER_ENGINE == name
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: finalize_internal()
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: ptrDesc->finalize()
  w! Time: 2025-05-02T07:49:35.236728 (0d+0h+0m+4s since start)
  w! Process=1364076; Thread=1364076; GPU=NULL; Handle=NULL; StreamId=NULL.

  after engine
  before engine heuristic
  after engine heuristic

  E! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
  e!         Error: CUDNN_STATUS_BAD_PARAM; Reason: ptrDesc->set(attributeName, attributeType, elementCount, arrayOfElements)
  e! Time: 2025-05-02T07:49:35.246351 (0d+0h+0m+4s since start)
  e! Process=1364076; Thread=1364076; GPU=NULL; Handle=NULL; StreamId=NULL.

  Failed: cuDNN error graph-api-example.c:355 'CUDNN_STATUS_BAD_PARAM'

Why?

  1. Thanks for the tip about getting the requisite workspace size and
    allocating it. I understand that. It is just that this is part of
    a big system that has a common mechanism elsewhere for determining
    how much workspace to allocate for the whole program and
    allocating that once at the beginning. So to simplify things, I
    didn’t include that in my minimal example.

  2. Re: “The overall API sequence of the c++ front-end wrapper is
    probably what you want here. I know you can’t use it directly, but
    it’s fundamentally just a wrapper – all the backend calls used
    here are used by it also.”

    I can’t determine from this code how to query the heuristics as
    per above as all it does is:

  REQUIRE(graph.create_execution_plans({fe::HeurMode_t::A}).is_good());

or

  REQUIRE(graph.create_execution_plans({fe::HeurMode_t::FALLBACK}).is_good());

Thanks,
Jeff
grap-api-example-v1.txt (20.2 KB)
grap-api-example-v2.txt (19.7 KB)
grap-api-example-v3.txt (21.2 KB)

Hi Jeff,

At least for the v3 code example, you needn’t set the engine to engine-config (the engine-config already has the engine), and you don’t need to finalize the engine-configs returned from the heuristic (they should already be ready to use). Those two minor changes should unblock you (IE, two API call deletions and it should work).

If the docs suggest that HEUR_MODE_A doesn’t support norm., then that’s wrong, and those will be corrected. For collecting heuristic results from _MODE_A, the descriptors need to be allocated, created, and once done with them, de-allocated (and if you are collecting many such results, contiguously allocated).

Let me know if you’d like to get v2 working, as I didn’t look at that.

– Seth

Thanks for your help.

I commented out the two API calls I think you are referring to in
graph-api-example-v4. Now i get the enclosed error. What do I need to
do to fix?

Is my call to cudnnBackendCreateDescriptor in
cudnn_engine_configuration_descriptor in the correct place,
i.e. before

  CUDNN_CHECK(cudnnBackendGetAttribute
	      (engheur, CUDNN_ATTR_ENGINEHEUR_RESULTS,
	       CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, NULL, &engcfg));

Does it matter what order the descriptors are freed?

  CUDNN_CHECK(cudnnBackendDestroyDescriptor(varpack));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(plan));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(engcfg));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(engheur));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(engine));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(op_graph));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(fprop));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(yDesc));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(epsilonDesc));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(betaDesc));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(gammaDesc));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(invVarianceDesc));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(meanDesc));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(xDesc));

graph-api-example-v4.txt (21.2 KB)
graph-api-example-v4.-log.txt (11.8 KB)

Thanks,
Jeff

The “CUDNN_STATUS_NOT_INITIALIZED” error is due to the _MODE_A heuristics result not supporting the given workload (and it tells you that because the engine-config is not finalized in that case - a slight correction to what I stated above). This can happen, and when it does for the full set of requested results (you can request more than 1), you’ll need to use the _FALLBACK mode to collect engine configurations. Not all workloads are supported by the faster engine configurations (we try, but there are edge cases), and when that happens, the _FALLBACK configurations are the next alternative.

For example, changing your program to a larger tensor worked with _MODE_A. If you can share a workload, I can set the dimensions / strides to correspond to that if needed.

For the last two questions:

  1. it looked ok ; bear in mind that it may be better to poll for more than one 1 heuristic result.
  2. destroying in reverse order as creation should be safe.

Thanks,

Three more questions:

  1. Is there a way to know the maximum number of engine configurations that can be returned by a call to cudnnBackendGetAttribute CUDNN_ATTR_ENGINEHEUR_RESULTS? So that I can allocate enough space to get them all? Is there some MAX_.ENGINEHEUR_RESULTS constant defined somewhere?
  2. Is there a way to if an engine configuration that is returned by a call to cudnnBackendGetAttribute CUDNN_ATTR_ENGINEHEUR_RESULTS is finalized?
  3. I presume that the way to select the best one is to go down the array and pick the first one that is finalized. Is that correct?

It would be great if all of this were documented.

OK. I think I understand things better now.

The key is to call

  CUDNN_CHECK(cudnnBackendGetAttribute
	      (engheur,
	       CUDNN_ATTR_ENGINEHEUR_RESULTS,
	       CUDNN_TYPE_BACKEND_DESCRIPTOR,
	       MAX_NUMBER_OF_ENGINE_CONFIGURATIONS,
	       &number_of_engine_configurations,
	       &engcfg));

with &number_of_engine_configurations instead of NULL to get the
number of engine configurations returned. If it is zero, then there
are none with that heuristic. If it is nonzero, all returned engine
configuration will be finalized. So I don’t have to check whether they
are finalized.

Is my understanding correct?

Is there ever any need to request more than 1, i.e. to set
MAX_NUMBER_OF_ENGINE_CONFIGURATIONS to a number greater than 1?
If so, how do I choose among them?

Your understanding is correct.

Regarding what’s returned by the heuristic: while it is reliable, we encourage users to autotune to find the best performing engine config. for their workload. The norm. engines are runtime compiled (per their behavior note), so once you have an engine-config you are happy with, you should keep the corresponding plan alive until that workload is retired, rather than putting the full API flow (as above) in some critical path (because the rutime kernel compilation might be slow).

Now I have tried to put in actual values and I get a segfault in

CUDNN_CHECK(cudnnBackendExecute(cudnn_handle, plan, varpack));y

(gdb shows that it is in that call.)

If I remove that, it runs to completion.

How do I fix?

graph-api-example-v8.txt (27.4 KB)

For epsilon, since it’s set by value, it should be a host pointer (not a GPU one). I can get _v8 to run successfully (on an Ampere arch. card) with a tiny tweak:

void *dev_ptrs[7] = {
    inputs_gpu,
    mean_gpu,
    inv_variance_gpu,
    gamma_gpu,
    beta_gpu,
    epsilon,
    outputs_gpu
  };

This corrected the segfault on my end, and the program terminated normally.

I have cuDNN layer normalization working in my code. I have also constructed a plan cache so I can reuse plans. I have two questions about this. Please see the enclosed code snippet.

  1. When I save the plan, can I destroy all of the other descriptors? Or does the plan reference these descriptors? See the comment “Can I do these here?” in the code snippet.
  2. Does creating the varpack take significant time? Should I cache that too? If I do, and I cache the dev_ptrs array, can I simply assign new pointers to that arrary and then just call cudnnBackendExecute? See the comment “Should I also cache the varpack?”. Can I change the FALSE to TRUE?

Also, I point out that the graph API for cuDNN is misdesigned. If you have a network API (i.e. torch.nn) you can store the plan on vertices of the network. But if you have a functional API (i.e. torch.nn.functional) there is no place store the plan. So you need a cache like I have that is indexed by the sizes of the arguments. That could make plan cache lookup slow.

This whole thing would have been simpler with the legacy API had you exposed an ability to just call the predefined kernels without having to construct a graph and do graph matching.

snippet.txt (4.9 KB)

Hi Jeff,

  1. I see that in the code snippet you are destroying the plan descriptor after caching it. Note that a cudnnBackendDescriptor is typedef void pointer to an opaque descriptor structure and by caching the plan descriptor you are doing a shallow copy of the underlying structure. Hence, you should definitely not destroy the plan descriptor. Among the other descriptors, you should be okay to destroy all descriptors besides the op_graph descriptor.

  2. Creating the varPack should not take significant time. Assigning new pointers to the dev_ptrs array will not propagate it to the varPack, you will need to set the CUDNN_ATTR_VARIANT_PACK_DATA_POINTERS attribute for that purpose. However setting an attribute on a finalized VarPack descriptor is not permitted so you would have to create and finalize a new VarPack descriptor every time before calling cudnnBackendExecute(). In other words, you cannot cache the varPack.

~ Rajas