Cannot create engine for a model with SoftmaxLayer

Description

When building an engine from a network definition using the C++ API only using an 5D input and a softmax layer, the engine can only be created when the input shape is smaller or equal to (1, 3, 248, 248, 248). I added a minimal example to reproduce the issue. The error I get is: “Error Code 10: Internal Error (Could not find any implementation for node softmax.)”.

Environment

**TensorRT Version8.6.1:
**GPU TypeRTX 2000 Ada Laptop GPU / RTX 5000 A:
**Nvidia Driver Version536.25:
**CUDA Version11.8:
**CUDNN Version8.9:
**Operating System + VersionWindows 10 / 11:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

softmax_bug.cpp (1.6 KB)

Hi,

Could you please share with us the complete verbose logs.

Thank you.

I also add the updated example file to reproduce this log.

softmax_bug_verbose.cpp (1.6 KB)
]

Verbose logs:
[MemUsageChange] Init CUDA: CPU +346, GPU +0, now: CPU 18570, GPU 1381 (MiB)
Trying to load shared library nvinfer_builder_resource.dll
Loaded shared library nvinfer_builder_resource.dll
[MemUsageChange
Init builder kernel library: CPU +1422, GPU +264, now: CPU 21129, GPU 1645 (MiB)
Original: 1 layers
After dead-layer removal: 1 layers
Graph construction completed in 0.0008457 seconds.
After Myelin optimization: 1 layers
Applying ScaleNodes fusions.
After scale fusion: 1 layers
After dupe layer removal: 1 layers
After final dead-layer removal: 1 layers
After tensor merging: 1 layers
After vertical fusions: 1 layers
After dupe layer removal: 1 layers
After final dead-layer removal: 1 layers
After tensor merging: 1 layers
After slice removal: 1 layers
After concat removal: 1 layers
Trying to split Reshape and strided tensor
Graph optimization time: 0.0025031 seconds.
Building graph using backend strategy 2
Local timing cache in use. Profiling results in this builder pass will not be stored.
Constructing optimization profile number 0 [1/1].
Applying generic optimizations to the graph for inference.
Reserving memory for host IO tensors. Host: 0 bytes
=============== Computing costs for softmax
*************** Autotuning format combination: Float(50331648,16777216,65536,256,1) → Float(50331648,16777216,65536,256,1) ***************
--------------- Timing Runner: softmax (CaskSoftMaxV2[0x80000040])
Skipping tactic 0x48c115a824ac468d due to exception shader run failed
softmax (CaskSoftMaxV2[0x80000040]) profiling completed in 0.0039532 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
10: Could not find any implementation for node softmax.
10: [optimizer.cpp::nvinfer1::builder::cgraph::LeafCNode::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node softmax.)

Any news on this topic?

Hello,

We have re-tested this with CUDA 12.3 & TensorRT 8.6.1.6 in Linux and now it actually works for layer sizes [1,3,X,X,X], X ∈ <0, 255>. It now fails for X > 255, so it is impossible to create the layer with dimension 256.

Is there any update on this issue?

Thank you.