CUDNN_ATTR_TENSOR_IS_BY_VALUE

qobi1 · April 28, 2025, 4:28am

In cudnn_graph Library — NVIDIA cuDNN Backend
The API for CUDNN_ATTR_OPERATION_NORM_FWD_EPSILON_DESC states

Scalar input tensor descriptor for the epsilon value used in
normalization calculation. Note that the attribute
CUDNN_ATTR_TENSOR_IS_BY_VALUE of this descriptor should be set to
true.

How do I set the CUDNN_ATTR_TENSOR_IS_BY_VALUE attribute to true?
I.e. what do I pass for … to

cudnnBackendSetAttribute(epsilonDesc,
CUDNN_ATTR_TENSOR_IS_BY_VALUE,
…,
1,
…));
What is a “scalar tensor”?
I.e. what should nDims, epsilonDim, epsilonStr, and alignment be
in the following:

cudnnBackendDescriptor_t epsilonDesc;
cudnnBackendCreateDescriptor(CUDNN_BACKEND_TENSOR_DESCRIPTOR, &epsilonDesc);
cudnnDataType_t dtype = CUDNN_DATA_FLOAT;
cudnnBackendSetAttribute(epsilonDesc,
CUDNN_ATTR_TENSOR_DATA_TYPE,
CUDNN_TYPE_DATA_TYPE, 1, &dtype);
int64_t nDims = 4;
int64_t epsilonDim[4] = {1, 1, 1, 1};
int64_t epsilonStr[4] = {1, 1, 1, 1};
int64_t epsilonUi = ‘e’;
int64_t alignment = 16;
cudnnBackendSetAttribute(epsilonDesc,
CUDNN_ATTR_TENSOR_DIMENSIONS,
CUDNN_TYPE_INT64, nDims, epsilonDim);
cudnnBackendSetAttribute(epsilonDesc,
CUDNN_ATTR_TENSOR_STRIDES,
CUDNN_TYPE_INT64, nDims, epsilonStr);
cudnnBackendSetAttribute(epsilonDesc,
CUDNN_ATTR_TENSOR_UNIQUE_ID,
CUDNN_TYPE_INT64, 1, &epsilonUi);
cudnnBackendSetAttribute(epsilonDesc,
CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
CUDNN_TYPE_INT64, 1, &alignment);

qobi · April 29, 2025, 1:36am

Reposting with the code formatted.

In the description of CUDNN_ATTR_OPERATION_NORM_FWD_EPSILON_DESC in

https://docs.nvidia.com/deeplearning/cudnn/backend/latest/api/cudnn-graph-library.html#cudnn-graph-library

it says:

Scalar input tensor descriptor for the epsilon value used in
normalization calculation. Note that the attribute
CUDNN_ATTR_TENSOR_IS_BY_VALUE of this descriptor should be set to
true.

How does one set the CUDNN_ATTR_TENSOR_IS_BY_VALUE attributed to true?
What values do I pass in for TYPE and VALUE in the following:

   CUDNN_CHECK(cudnnBackendSetAttribute(xDesc, CUDNN_ATTR_TENSOR_IS_BY_VALUE,
                                        TYPE, 1, VALUE));

More generally, what is a “scalar tensor”? How does one create one?
What should nDims, xDim, and XStr be in the following:

  cudnnBackendDescriptor_t xDesc;
  CUDNN_CHECK(cudnnBackendCreateDescriptor
	      (CUDNN_BACKEND_TENSOR_DESCRIPTOR, &xDesc));
  cudnnDataType_t dtype = CUDNN_DATA_FLOAT;
  CUDNN_CHECK(cudnnBackendSetAttribute(xDesc, CUDNN_ATTR_TENSOR_DATA_TYPE,
				       CUDNN_TYPE_DATA_TYPE, 1, &dtype));
  int64_t nDims = 4;
  int64_t xDim[4] = {1, 1, 1, 1};
  int64_t xStr[4] = {1, 1, 1, 1};
  int64_t xUi = 'e';
  int64_t alignment = 16;
  CUDNN_CHECK(cudnnBackendSetAttribute(xDesc, CUDNN_ATTR_TENSOR_DIMENSIONS,
				       CUDNN_TYPE_INT64, nDims, xDim));
  CUDNN_CHECK(cudnnBackendSetAttribute(xDesc, CUDNN_ATTR_TENSOR_STRIDES,
				       CUDNN_TYPE_INT64, nDims, xStr));
  CUDNN_CHECK(cudnnBackendSetAttribute(xDesc, CUDNN_ATTR_TENSOR_UNIQUE_ID,
				       CUDNN_TYPE_INT64, 1, &xUi));
  CUDNN_CHECK(cudnnBackendSetAttribute
	      (xDesc, CUDNN_ATTR_TENSOR_BYTE_ALIGNMENT,
	       CUDNN_TYPE_INT64, 1, &alignment));
  CUDNN_CHECK(cudnnBackendSetAttribute(xDesc, CUDNN_ATTR_TENSOR_IS_BY_VALUE,
				       CUDNN_TYPE_INT64, 1, &xUi));
  CUDNN_CHECK(cudnnBackendFinalize(xDesc));

Also, I get the following when trying to set up layer normalization:

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_MODE (2000);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_NORM_MODE (27);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=int; val=0;
i! Time: 2025-04-28T21:27:51.992003 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_PHASE (2001);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_NORM_FWD_PHASE (28);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=int; val=1;
i! Time: 2025-04-28T21:27:51.992044 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_XDESC (2002);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_TENSOR_DESCRIPTOR; val=0x560694020c00;
i!         : type=bool; val=true;
i! Time: 2025-04-28T21:27:51.992086 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_MEAN_DESC (2003);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_TENSOR_DESCRIPTOR; val=0x56069533cbc0;
i!         : type=bool; val=true;
i! Time: 2025-04-28T21:27:51.992138 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_INV_VARIANCE_DESC (2004);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_TENSOR_DESCRIPTOR; val=0x5606952cbe20;
i!         : type=bool; val=true;
i! Time: 2025-04-28T21:27:51.992183 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_SCALE_DESC (2005);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_TENSOR_DESCRIPTOR; val=0x5606952cbf50;
i!         : type=bool; val=true;
i! Time: 2025-04-28T21:27:51.992228 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_BIAS_DESC (2006);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_TENSOR_DESCRIPTOR; val=0x5606952cc080;
i!         : type=bool; val=true;
i! Time: 2025-04-28T21:27:51.992274 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_EPSILON_DESC (2007);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_TENSOR_DESCRIPTOR; val=0x5606952cc1b0;
i!         : type=bool; val=true;
i! Time: 2025-04-28T21:27:51.992319 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i!     attributeName: type=cudnnBackendAttributeName_t; val=CUDNN_ATTR_OPERATION_NORM_FWD_YDESC (2013);
i!     attributeType: type=cudnnBackendAttributeType_t; val=CUDNN_TYPE_BACKEND_DESCRIPTOR (15);
i!     elementCount: type=int64_t; val=1;
i!     [0]: type=CUDNN_BACKEND_TENSOR_DESCRIPTOR; val=0x5606952cc2e0;
i!         : type=bool; val=true;
i! Time: 2025-04-28T21:27:51.992360 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

I! CuDNN (v90800 87) function cudnnBackendFinalize() called:
i!     descriptor: type=CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR; val=NOT_IMPLEMENTED;
i! Time: 2025-04-28T21:27:51.992413 (0d+0h+0m+0s since start)
i! Process=761979; Thread=761979; GPU=NULL; Handle=NULL; StreamId=NULL.

Does this mean that layer normalization is not implemented?

sophwats · April 29, 2025, 5:52pm

Hi @qobi

LayerNorm is supported in cuDNN. To make it, and other prominent operations accessible, we have c++ front-end wrapper to the backend available along with sample code. This should be easier to use. Please check it out here: cudnn-frontend/samples/cpp/norm at main · NVIDIA/cudnn-frontend · GitHub . We are happy to answer backend API questions if for some reason the front-end wrapper won’t work for your use case.

Best,

Sophie

qobi1 · April 29, 2025, 6:24pm

Thanks. I need to integrate layer normalization into a large program that is written in C, not C++. That is why I am using the backend, not the frontend wrapper. It would be helpful to know how to get the backend code working.

I enclose a snippet of my code. I basically took it from the NVidia web page doccumentation for cuDNN.

Thanks,
Jeff

graph-api-example.txt (18.8 KB)

swalters · April 30, 2025, 10:21pm

Hi Jeff,

Does the following example code-snippet unblock you?

bool xByValue = true;
CUDNN_CHECK(cudnnBackendSetAttribute(xDesc, CUDNN_ATTR_TENSOR_IS_BY_VALUE, CUDNN_TYPE_BOOLEAN, 1, &xByValue));

– Seth

qobi1 · May 1, 2025, 12:24am

Thanks. That helps.

As an aside, C doesn’t have the type bool and the constant true.
So I did:

#define TRUE (0==0)
int64_t epsilonByValue = TRUE;
CUDNN_CHECK(cudnnBackendSetAttribute(epsilonDesc, CUDNN_ATTR_TENSOR_IS_BY_VALUE,
                                     CUDNN_TYPE_BOOLEAN, 1, &epsilonByValue));

I presume this is still correct.

Now I get:

qobi@poto>./graph-api-example
before op_graph

W! CuDNN (v90800 87) function cudnnBackendFinalize() called:
w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: !patternMatchOptionalLinearDAG(this->getEntranceNodes().front(), userGraph->getEntranceNodes().front())
w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: referenceGraph->patternMatchLinearDAG(userGraph)
w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: LinearPatternMatcher::matchPattern(userGraph)
w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: matchPattern(opg.getGraph())
w! Time: 2025-04-30T20:06:05.217680 (0d+0h+0m+4s since start)
w! Process=1098065; Thread=1098065; GPU=NULL; Handle=NULL; StreamId=NULL.

after op_graph
before engcfg

E! CuDNN (v90800 87) function cudnnBackendFinalize() called:
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: fillup_num_execution_groups()
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: status
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: ptr->isSupported()
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: finalize_internal()
e!         Error: CUDNN_STATUS_EXECUTION_FAILED; Reason: ptrDesc->finalize()
e! Time: 2025-04-30T20:06:05.217805 (0d+0h+0m+4s since start)
e! Process=1098065; Thread=1098065; GPU=NULL; Handle=NULL; StreamId=NULL.

Failed: cuDNN error graph-api-example.c:319 'CUDNN_STATUS_EXECUTION_FAILED'
qobi@poto>

I presume this means that my constructed graph doesn’t match one of
the standard patterns that it has a kernel for. But my code only has
one node which is a layer normalization. I don’t know what aspect of
this doesn’t match the known kernels. If you look at the top of the
file you see what my understanding of the requirements are by reading
an old version of the docs. I can’t find the equivalent in the new
version of the docs.

The current code is enclosed.

graph-api-example.txt (18.9 KB)

What do I need to change?

Thanks,
Jeff

swalters · May 1, 2025, 3:21pm

I think the obstacles are:

You need to set gidx to 1 (more on that in a moment).
Set mean, inv. var descriptors on the bwds operation.
Allow more than 0 bytes workspace (128 worked locally for me)

If you modify like the above, then both cudnn_ln and cudnn_dlndx12b should return “success”. I did not modify things further to execute.

A few things:

The performance won’t be what you want as-is: you should be querying heuristics for your workloads (HEUR_MODE_A). This will allow you to move beyond hard-coding gidx as above (which should not be done generally).
You’ll eventually need to get the workspace from the finalized plan, allocate it, and pass it back (as for gidx, hardcoding is not recommended).

The overall API sequence of the c++ front-end wrapper is probably what you want here. I know you can’t use it directly, but it’s fundamentally just a wrapper – all the backend calls used here are used by it also.

I hope this helps & unblocks you, but if not, update here, and we’ll try to help.

qobi · May 1, 2025, 6:59pm

Thanks for your help. I got the forward pass to work Will work on the
backward pass later this afternoon. Several more questions:

cudnn_graph Library — NVIDIA cuDNN Backend
Setting Up An Engine Configuration
has example code snippet 1 which sets gidx to 0 which is why I did that.
Why does it need to be 1 instead of 0?
Is there documentation that specifies what the possible values are
and what they mean?
If the docs are wrong, perhaps change them.
cudnn_graph Library — NVIDIA cuDNN Backend
CUDNN_ATTR_OPERATION_NORM_BWD_MEAN_DESC
and
CUDNN_ATTR_OPERATION_NORM_BWD_INV_VARIANCE_DESC
both say that they are Optional attributes. This is why I did not
include them. In the old API for batch normalization
(cudnnBatchNormalizationBackward) these were optional and just
made things faster (at the expense of some more memory). Are these
required for layer normalization with the graph API? If so,
perhaps change the docs.

Thanks,
Jeff

qobi · May 2, 2025, 12:15pm

Thanks. I have several more questions. All my questions refer to the
documentation at:

cudnn_graph Library — NVIDIA cuDNN Backend

The docs say CUDNN_HEUR_MODE_A and CUDNN_HEUR_MODE_B each support
a specific list of operation nodes or graphs. Neither of them list
layer normalization. You suggested that I use the heuristic modes.
What do I do?
cudnnBackendGetAttribute of CUDNN_ATTR_ENGINEHEUR_RESULTS on an
CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR returns an
CUDNN_BACKEND_ENGINECFG_DESCRIPTOR. Do I first have to create the
CUDNN_BACKEND_ENGINECFG_DESCRIPTOR with cudnnBackendCreateDescriptor
or does CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR do that for me? I.e.
after doing:

  CUDNN_CHECK(cudnnBackendCreateDescriptor
    (CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR, &engheur));
  CUDNN_CHECK(cudnnBackendSetAttribute
    (engheur, CUDNN_ATTR_ENGINEHEUR_OPERATION_GRAPH,
     CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &op_graph));
  CUDNN_CHECK(cudnnBackendSetAttribute
    (engheur, CUDNN_ATTR_ENGINEHEUR_MODE,
     CUDNN_TYPE_HEUR_MODE, 1, &heur_mode));
  CUDNN_CHECK(cudnnBackendFinalize(engheur));

should I do:

 CUDNN_CHECK(cudnnBackendCreateDescriptor
    (CUDNN_BACKEND_ENGINECFG_DESCRIPTOR, &engcfg));

before I do:

  CUDNN_CHECK(cudnnBackendGetAttribute
    (engheur, CUDNN_ATTR_ENGINEHEUR_RESULTS,
   CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, NULL, &engcfg));
  CUDNN_CHECK(cudnnBackendSetAttribute
    (engcfg, CUDNN_ATTR_ENGINECFG_ENGINE,
   CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, &engine));
  CUDNN_CHECK(cudnnBackendFinalize(engcfg));

Do I have to? The docs don’t say.

Should I do:

  CUDNN_CHECK(cudnnBackendDestroyDescriptor(engcfg));

on the CUDNN_BACKEND_ENGINECFG_DESCRIPTOR returned by
cudnnBackendGetAttribute of CUDNN_ATTR_ENGINEHEUR_RESULTS on an
CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR? Do I have to? The docs don’t
say.

If I run the enclosed graph-api-example-v2 I get:

  before op_graph

  W! CuDNN (v90800 87) function cudnnBackendFinalize() called:
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: !patternMatchOptionalLinearDAG(this->getEntranceNodes().front(), userGraph->getEntranceNodes().front())
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: referenceGraph->patternMatchLinearDAG(userGraph)
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: LinearPatternMatcher::matchPattern(userGraph)
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: matchPattern(opg.getGraph())
  w! Time: 2025-05-02T07:47:53.837436 (0d+0h+0m+3s since start)
  w! Process=1363821; Thread=1363821; GPU=NULL; Handle=NULL; StreamId=NULL.

  after op_graph
  cudnn_ln success
  before op_graph

  W! CuDNN (v90800 87) function cudnnBackendFinalize() called:
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: !patternMatchOptionalLinearDAG(this->getEntranceNodes().front(), userGraph->getEntranceNodes().front())
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: referenceGraph->patternMatchLinearDAG(userGraph)
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: LinearPatternMatcher::matchPattern(userGraph)
  w!         Warning: CUDNN_STATUS_NOT_S UPPORTED_GRAPH_PATTERN; Reason: matchPattern(opg.getGraph())
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: Non-pointwise ops before the forward norm op are not supported at: node_ptr->opType != CUDNN_BACKEND_OPERATION_POINTWISE_DESCRIPTOR
  w! Time: 2025-05-02T07:47:53.987149 (0d+0h+0m+3s since start)
  w! Process=1363821; Thread=1363821; GPU=NULL; Handle=NULL; StreamId=NULL.

  after op_graph
  cudnn_dlndx12b success

Why?

If I run the enclosed graph-api-example-v3 I get:

  before op_graph

  W! CuDNN (v90800 87) function cudnnBackendFinalize() called:
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: !patternMatchOptionalLinearDAG(this->getEntranceNodes().front(), userGraph->getEntranceNodes().front())
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: referenceGraph->patternMatchLinearDAG(userGraph)
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: LinearPatternMatcher::matchPattern(userGraph)
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED_GRAPH_PATTERN; Reason: matchPattern(opg.getGraph())
  w! Time: 2025-05-02T07:49:35.236680 (0d+0h+0m+4s since start)
  w! Process=1364076; Thread=1364076; GPU=NULL; Handle=NULL; StreamId=NULL.

  after op_graph
  before engine

  W! CuDNN (v90800 87) function cudnnBackendFinalize() called:
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: The specified engine with global index -1 is not found for NORM_FWD_TRAIN(18) at: CUDNN_INVALID_PLACE_HOLDER_ENGINE == name
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: finalize_internal()
  w!         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: ptrDesc->finalize()
  w! Time: 2025-05-02T07:49:35.236728 (0d+0h+0m+4s since start)
  w! Process=1364076; Thread=1364076; GPU=NULL; Handle=NULL; StreamId=NULL.

  after engine
  before engine heuristic
  after engine heuristic

  E! CuDNN (v90800 87) function cudnnBackendSetAttribute() called:
  e!         Error: CUDNN_STATUS_BAD_PARAM; Reason: ptrDesc->set(attributeName, attributeType, elementCount, arrayOfElements)
  e! Time: 2025-05-02T07:49:35.246351 (0d+0h+0m+4s since start)
  e! Process=1364076; Thread=1364076; GPU=NULL; Handle=NULL; StreamId=NULL.

  Failed: cuDNN error graph-api-example.c:355 'CUDNN_STATUS_BAD_PARAM'

Why?

Thanks for the tip about getting the requisite workspace size and
allocating it. I understand that. It is just that this is part of
a big system that has a common mechanism elsewhere for determining
how much workspace to allocate for the whole program and
allocating that once at the beginning. So to simplify things, I
didn’t include that in my minimal example.
Re: “The overall API sequence of the c++ front-end wrapper is
probably what you want here. I know you can’t use it directly, but
it’s fundamentally just a wrapper – all the backend calls used
here are used by it also.”

I can’t determine from this code how to query the heuristics as
per above as all it does is:

  REQUIRE(graph.create_execution_plans({fe::HeurMode_t::A}).is_good());

or

  REQUIRE(graph.create_execution_plans({fe::HeurMode_t::FALLBACK}).is_good());

Thanks,
Jeff
grap-api-example-v1.txt (20.2 KB)
grap-api-example-v2.txt (19.7 KB)
grap-api-example-v3.txt (21.2 KB)

swalters · May 8, 2025, 6:46pm

Hi Jeff,

At least for the v3 code example, you needn’t set the engine to engine-config (the engine-config already has the engine), and you don’t need to finalize the engine-configs returned from the heuristic (they should already be ready to use). Those two minor changes should unblock you (IE, two API call deletions and it should work).

If the docs suggest that HEUR_MODE_A doesn’t support norm., then that’s wrong, and those will be corrected. For collecting heuristic results from _MODE_A, the descriptors need to be allocated, created, and once done with them, de-allocated (and if you are collecting many such results, contiguously allocated).

Let me know if you’d like to get v2 working, as I didn’t look at that.

– Seth

qobi · May 9, 2025, 10:26pm

Thanks for your help.

I commented out the two API calls I think you are referring to in
graph-api-example-v4. Now i get the enclosed error. What do I need to
do to fix?

Is my call to cudnnBackendCreateDescriptor in
cudnn_engine_configuration_descriptor in the correct place,
i.e. before

  CUDNN_CHECK(cudnnBackendGetAttribute
	      (engheur, CUDNN_ATTR_ENGINEHEUR_RESULTS,
	       CUDNN_TYPE_BACKEND_DESCRIPTOR, 1, NULL, &engcfg));

Does it matter what order the descriptors are freed?

  CUDNN_CHECK(cudnnBackendDestroyDescriptor(varpack));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(plan));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(engcfg));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(engheur));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(engine));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(op_graph));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(fprop));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(yDesc));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(epsilonDesc));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(betaDesc));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(gammaDesc));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(invVarianceDesc));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(meanDesc));
  CUDNN_CHECK(cudnnBackendDestroyDescriptor(xDesc));

graph-api-example-v4.txt (21.2 KB)
graph-api-example-v4.-log.txt (11.8 KB)

Thanks,
Jeff

swalters · May 12, 2025, 7:30pm

The “CUDNN_STATUS_NOT_INITIALIZED” error is due to the _MODE_A heuristics result not supporting the given workload (and it tells you that because the engine-config is not finalized in that case - a slight correction to what I stated above). This can happen, and when it does for the full set of requested results (you can request more than 1), you’ll need to use the _FALLBACK mode to collect engine configurations. Not all workloads are supported by the faster engine configurations (we try, but there are edge cases), and when that happens, the _FALLBACK configurations are the next alternative.

For example, changing your program to a larger tensor worked with _MODE_A. If you can share a workload, I can set the dimensions / strides to correspond to that if needed.

For the last two questions:

it looked ok ; bear in mind that it may be better to poll for more than one 1 heuristic result.
destroying in reverse order as creation should be safe.

qobi1 · May 13, 2025, 2:51am

Thanks,

Three more questions:

Is there a way to know the maximum number of engine configurations that can be returned by a call to cudnnBackendGetAttribute CUDNN_ATTR_ENGINEHEUR_RESULTS? So that I can allocate enough space to get them all? Is there some MAX_.ENGINEHEUR_RESULTS constant defined somewhere?
Is there a way to if an engine configuration that is returned by a call to cudnnBackendGetAttribute CUDNN_ATTR_ENGINEHEUR_RESULTS is finalized?
I presume that the way to select the best one is to go down the array and pick the first one that is finalized. Is that correct?

It would be great if all of this were documented.

qobi · May 13, 2025, 3:57pm

OK. I think I understand things better now.

The key is to call

  CUDNN_CHECK(cudnnBackendGetAttribute
	      (engheur,
	       CUDNN_ATTR_ENGINEHEUR_RESULTS,
	       CUDNN_TYPE_BACKEND_DESCRIPTOR,
	       MAX_NUMBER_OF_ENGINE_CONFIGURATIONS,
	       &number_of_engine_configurations,
	       &engcfg));

with &number_of_engine_configurations instead of NULL to get the
number of engine configurations returned. If it is zero, then there
are none with that heuristic. If it is nonzero, all returned engine
configuration will be finalized. So I don’t have to check whether they
are finalized.

Is my understanding correct?

Is there ever any need to request more than 1, i.e. to set
MAX_NUMBER_OF_ENGINE_CONFIGURATIONS to a number greater than 1?
If so, how do I choose among them?

swalters · May 15, 2025, 7:53pm

Your understanding is correct.

Regarding what’s returned by the heuristic: while it is reliable, we encourage users to autotune to find the best performing engine config. for their workload. The norm. engines are runtime compiled (per their behavior note), so once you have an engine-config you are happy with, you should keep the corresponding plan alive until that workload is retired, rather than putting the full API flow (as above) in some critical path (because the rutime kernel compilation might be slow).

qobi1 · May 20, 2025, 9:58pm

Now I have tried to put in actual values and I get a segfault in

CUDNN_CHECK(cudnnBackendExecute(cudnn_handle, plan, varpack));y

(gdb shows that it is in that call.)

If I remove that, it runs to completion.

How do I fix?

graph-api-example-v8.txt (27.4 KB)

swalters · May 23, 2025, 3:45pm

For epsilon, since it’s set by value, it should be a host pointer (not a GPU one). I can get _v8 to run successfully (on an Ampere arch. card) with a tiny tweak:

void *dev_ptrs[7] = {
    inputs_gpu,
    mean_gpu,
    inv_variance_gpu,
    gamma_gpu,
    beta_gpu,
    epsilon,
    outputs_gpu
  };

This corrected the segfault on my end, and the program terminated normally.

qobi1 · May 30, 2025, 1:30am

I have cuDNN layer normalization working in my code. I have also constructed a plan cache so I can reuse plans. I have two questions about this. Please see the enclosed code snippet.

When I save the plan, can I destroy all of the other descriptors? Or does the plan reference these descriptors? See the comment “Can I do these here?” in the code snippet.
Does creating the varpack take significant time? Should I cache that too? If I do, and I cache the dev_ptrs array, can I simply assign new pointers to that arrary and then just call cudnnBackendExecute? See the comment “Should I also cache the varpack?”. Can I change the FALSE to TRUE?

Also, I point out that the graph API for cuDNN is misdesigned. If you have a network API (i.e. torch.nn) you can store the plan on vertices of the network. But if you have a functional API (i.e. torch.nn.functional) there is no place store the plan. So you need a cache like I have that is indexed by the sizes of the arguments. That could make plan cache lookup slow.

This whole thing would have been simpler with the legacy API had you exposed an ability to just call the predefined kernels without having to construct a graph and do graph matching.

snippet.txt (4.9 KB)

rmhaskar · June 3, 2025, 7:20pm

Hi Jeff,

I see that in the code snippet you are destroying the plan descriptor after caching it. Note that a cudnnBackendDescriptor is typedef void pointer to an opaque descriptor structure and by caching the plan descriptor you are doing a shallow copy of the underlying structure. Hence, you should definitely not destroy the plan descriptor. Among the other descriptors, you should be okay to destroy all descriptors besides the op_graph descriptor.
Creating the varPack should not take significant time. Assigning new pointers to the dev_ptrs array will not propagate it to the varPack, you will need to set the CUDNN_ATTR_VARIANT_PACK_DATA_POINTERS attribute for that purpose. However setting an attribute on a finalized VarPack descriptor is not permitted so you would have to create and finalize a new VarPack descriptor every time before calling cudnnBackendExecute(). In other words, you cannot cache the varPack.

~ Rajas

Topic		Replies	Views
Why do I get an error finalizing an operation graph for CUDNN_LAYER_NORM CUDNN_BACKEND_OPERATION_NORM_FORWARD_DESCRIPTOR Linux cudnn	0	73	February 8, 2025
code samples for BatchNormalizationForwardInference cuDNN	2	968	October 12, 2018
Using cuDNN Backend to Create a Fused Attention fprop Graph cuDNN cudnn	6	360	January 3, 2025
How use the cudnn graph API for do a convolution cuDNN cudnn	2	259	February 15, 2025
Tensor packing and cryptic errors cuDNN	6	1118	June 8, 2022
Batch normalization implementation using cuDNN cuDNN cuda , cudnn	1	2098	November 20, 2020
cudnnBackendGetAttribute elementCount confusion cuDNN cuda , cudnn	4	179	March 29, 2025
Cudnn Backend Resample examples cuDNN	4	1197	June 21, 2022
cudnnBatchNormalizationForwardInference results are different from Tensorflow cuDNN	1	661	March 11, 2020
Specify conv algorithm in backend API cuDNN cudnn	3	604	October 26, 2023

CUDNN_ATTR_TENSOR_IS_BY_VALUE

Related topics