CUDNN_STATUS_NOT_SUPPORTED when call cudnnOpTensor()

Happy new year!
I’ve read the API document carefully, and it said that this error may caused by two reasons:
1.

dimensions of the bias tensor and the output tensor dimensions are above 5

My bias tensor is 1D, ang the output tensor is 3D.

opTensorCompType is not set as stated above.

In my code, op=FLOAT, A=FLOAT, B=FLOAT, C=FLOAT, which strictly matched the first line of [the supported datatypes]( API Reference :: NVIDIA Deep Learning cuDNN Documentation).

My codes are as follows, please help me, thx


dims[0] = N;
dims[1] = N;
dims[2] = 1;
strides[0] = dims[2] * dims[1];
strides[1] = dims[2];
strides[2] = 1;

cudnnHandle_t cudnn;
CUDNN_CALL(cudnnCreate(&cudnn));

// initialize descriptor
cudnnTensorDescriptor_t aDesc;
cudnnTensorDescriptor_t bDesc;
cudnnTensorDescriptor_t cDesc;
cudnnOpTensorDescriptor_t opDesc;
CUDNN_CALL(cudnnCreateTensorDescriptor(&aDesc));
CUDNN_CALL(cudnnCreateTensorDescriptor(&bDesc));
CUDNN_CALL(cudnnCreateTensorDescriptor(&cDesc));
CUDNN_CALL(cudnnCreateOpTensorDescriptor(&opDesc));
  
CUDNN_CALL(
      cudnnSetTensorNdDescriptor(aDesc, CUDNN_DATA_FLOAT, 3, dims, strides));CUDNN_CALL(
      cudnnSetTensorNdDescriptor(bDesc, CUDNN_DATA_FLOAT, 3, dims, strides));
CUDNN_CALL(
      cudnnSetTensorNdDescriptor(cDesc, CUDNN_DATA_FLOAT, 3, dims, strides));
CUDNN_CALL(
      cudnnSetOpTensorDescriptor(opDesc, CUDNN_OP_TENSOR_ADD, CUDNN_DATA_FLOAT,
CUDNN_NOT_PROPAGATE_NAN));
      
const float alpha = 1.0f;
const float beta = 0.0f;

// allocate memory
auto a = allocate<float>(dims[0] * dims[1] * dims[2] * sizeof(float));
auto b = allocate<float>(dims[0] * dims[1] * dims[2] * sizeof(float));
auto c = allocate<float>(dims[0] * dims[1] * dims[2] * sizeof(float));
// initial 
curandGenerator_t gen;
CURAND_CALL(curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_XORWOW));
CURAND_CALL(curandGenerateUniform(gen, a.get(), dims[0] * dims[1] * dims[2]));
CURAND_CALL(curandGenerateUniform(gen, b.get(), dims[0] * dims[1] * dims[2]));
CUDA_CALL(cudaDeviceSynchronize());

CUDNN_CALL(cudnnOpTensor(cudnn, opDesc, &alpha, aDesc, a.get(), &alpha, bDesc, b.get(), &beta, cDesc, c.get()));

supplement:

CUXX_CALLis a wrapper for checking, allocate will return a shared_ptr pointing to a space created by cudaMalloc

I’ve solved this problem, it may be caused by some bad support.
When I changed the 3D data to 4D, it worked successfully.