Transferring ONNX Softmax operation to TensorRT

Hi.
I’ve tested TensorRT Softmax operation which converted from ONNX model.
I made a single layer Softmax for (3, 4, 5) input/output shape with the following code.
However, it seems that TensorRT launched by trtexec converts the output as (1, 4, 5) shape. I suppose that the output shape of Softmax is same as the input shape.
Would you please tell me what’s wrong on my test?

import onnx
import onnx.helper as oh
from onnx import checker

out_path = "softmax_test.onnx"


def main():

    in_tensor = [
        oh.make_tensor_value_info("Input", onnx.TensorProto.FLOAT, [3, 4, 5]),
    ]
    
    out_tensor = [
        oh.make_tensor_value_info("Output", onnx.TensorProto.FLOAT, [3, 4, 5]),
    ]

    nodes = []
    nodes.append(oh.make_node("Softmax", axis=1, inputs=["Input"], outputs=["Output"]))

    graph = oh.make_graph(nodes, "Test Graph", in_tensor, out_tensor)

    checker.check_graph(graph)

    model = oh.make_model(graph, producer_name="TFURU2", producer_version="0.1")

    checker.check_model(model)

    with open(out_path, "wb") as f:
        f.write(model.SerializeToString())

    with open(out_path + ".txt", "w") as f:
        print(model, file=f)


if __name__ == "__main__":
    main()

Here is the trtexec output.

&&&& RUNNING TensorRT.trtexec # trtexec --onnx=softmax_test.onnx --verbose --dumpOutput --batch=1 --safe
[09/01/2019-08:44:04] [I] === Model Options ===
[09/01/2019-08:44:04] [I] Format: ONNX
[09/01/2019-08:44:04] [I] Model: softmax_test.onnx
[09/01/2019-08:44:04] [I] Output:
[09/01/2019-08:44:04] [I] === Build Options ===
[09/01/2019-08:44:04] [I] Max batch: 1
[09/01/2019-08:44:04] [I] Workspace: 16 MB
[09/01/2019-08:44:04] [I] minTiming: 1
[09/01/2019-08:44:04] [I] avgTiming: 8
[09/01/2019-08:44:04] [I] Precision: FP32
[09/01/2019-08:44:04] [I] Calibration: 
[09/01/2019-08:44:04] [I] Safe mode: Enabled
[09/01/2019-08:44:04] [I] Save engine: 
[09/01/2019-08:44:04] [I] Load engine: 
[09/01/2019-08:44:04] [I] Inputs format: fp32:CHW
[09/01/2019-08:44:04] [I] Outputs format: fp32:CHW
[09/01/2019-08:44:04] [I] Input build shapes: model
[09/01/2019-08:44:04] [I] === System Options ===
[09/01/2019-08:44:04] [I] Device: 0
[09/01/2019-08:44:04] [I] DLACore: 
[09/01/2019-08:44:04] [I] Plugins:
[09/01/2019-08:44:04] [I] === Inference Options ===
[09/01/2019-08:44:04] [I] Batch: 1
[09/01/2019-08:44:04] [I] Iterations: 10 (200 ms warm up)
[09/01/2019-08:44:04] [I] Duration: 10s
[09/01/2019-08:44:04] [I] Sleep time: 0ms
[09/01/2019-08:44:04] [I] Streams: 1
[09/01/2019-08:44:04] [I] Spin-wait: Disabled
[09/01/2019-08:44:04] [I] Multithreading: Enabled
[09/01/2019-08:44:04] [I] CUDA Graph: Disabled
[09/01/2019-08:44:04] [I] Skip inference: Disabled
[09/01/2019-08:44:04] [I] Input inference shapes: model
[09/01/2019-08:44:04] [I] === Reporting Options ===
[09/01/2019-08:44:04] [I] Verbose: Enabled
[09/01/2019-08:44:04] [I] Averages: 10 inferences
[09/01/2019-08:44:04] [I] Percentile: 99
[09/01/2019-08:44:04] [I] Dump output: Enabled
[09/01/2019-08:44:04] [I] Profile: Disabled
[09/01/2019-08:44:04] [I] Export timing to JSON file: 
[09/01/2019-08:44:04] [I] Export profile to JSON file: 
[09/01/2019-08:44:04] [I] 
[09/01/2019-08:44:04] [V] [TRT] Plugin Creator registration succeeded - GridAnchor_TRT
[09/01/2019-08:44:04] [V] [TRT] Plugin Creator registration succeeded - NMS_TRT
[09/01/2019-08:44:04] [V] [TRT] Plugin Creator registration succeeded - Reorg_TRT
[09/01/2019-08:44:04] [V] [TRT] Plugin Creator registration succeeded - Region_TRT
[09/01/2019-08:44:04] [V] [TRT] Plugin Creator registration succeeded - Clip_TRT
[09/01/2019-08:44:04] [V] [TRT] Plugin Creator registration succeeded - LReLU_TRT
[09/01/2019-08:44:04] [V] [TRT] Plugin Creator registration succeeded - PriorBox_TRT
[09/01/2019-08:44:04] [V] [TRT] Plugin Creator registration succeeded - Normalize_TRT
[09/01/2019-08:44:04] [V] [TRT] Plugin Creator registration succeeded - RPROI_TRT
[09/01/2019-08:44:04] [V] [TRT] Plugin Creator registration succeeded - BatchedNMS_TRT
[09/01/2019-08:44:04] [V] [TRT] Plugin Creator registration succeeded - FlattenConcat_TRT
----------------------------------------------------------------
Input filename:   softmax_test.onnx
ONNX IR version:  0.0.4
Opset version:    9
Producer name:    MACNICA
Producer version: 0.1
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
[09/01/2019-08:44:04] [V] [TRT] Output:Softmax -> (4, 5)
 ----- Parsing of ONNX model softmax_test.onnx is Done ---- 
[09/01/2019-08:44:04] [V] [TRT] Applying generic optimizations to the graph for inference.
[09/01/2019-08:44:04] [V] [TRT] Original: 1 layers
[09/01/2019-08:44:04] [V] [TRT] After dead-layer removal: 1 layers
[09/01/2019-08:44:04] [V] [TRT] After scale fusion: 1 layers
[09/01/2019-08:44:04] [V] [TRT] After vertical fusions: 1 layers
[09/01/2019-08:44:04] [V] [TRT] After final dead-layer removal: 1 layers
[09/01/2019-08:44:04] [V] [TRT] After tensor merging: 1 layers
[09/01/2019-08:44:04] [V] [TRT] After concat removal: 1 layers
[09/01/2019-08:44:04] [V] [TRT] Graph construction and optimization completed in 0.000163059 seconds.
[09/01/2019-08:44:06] [V] [TRT] Constructing optimization profile number 0 out of 1
*************** Autotuning format combination: Float(1,5,20) -> Float(1,5,20) ***************
[09/01/2019-08:44:06] [V] [TRT] --------------- Timing Runner: (Unnamed Layer* 0) [Softmax] (SoftMax)
[09/01/2019-08:44:06] [V] [TRT] Tactic: 1001 time 0.007168
[09/01/2019-08:44:06] [V] [TRT] Fastest Tactic: 1001 Time: 0.007168
[09/01/2019-08:44:06] [V] [TRT] >>>>>>>>>>>>>>> Chose Runner Type: SoftMax Tactic: 1001
[09/01/2019-08:44:06] [V] [TRT] 
[09/01/2019-08:44:06] [V] [TRT] Formats and tactics selection completed in 0.00245976 seconds.
[09/01/2019-08:44:06] [V] [TRT] After reformat layers: 1 layers
[09/01/2019-08:44:06] [V] [TRT] Block size 16777216
[09/01/2019-08:44:06] [V] [TRT] Total Activation Memory: 16777216
[09/01/2019-08:44:06] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[09/01/2019-08:44:06] [V] [TRT] Engine generation completed in 1.42676 seconds.
[09/01/2019-08:44:06] [V] [TRT] Engine Layer Information:
[09/01/2019-08:44:06] [V] [TRT] Layer: (Unnamed Layer* 0) [Softmax] (SoftMax), Tactic: 1001, Input[Float(4,5)] -> Output[Float(4,5)]
[09/01/2019-08:44:06] [I] Average over 10 runs is 0.0111616 ms (host walltime is 0.0464016 ms, 99% percentile time is 0.024576).
[09/01/2019-08:44:06] [I] Average over 10 runs is 0.0091136 ms (host walltime is 0.0346638 ms, 99% percentile time is 0.011264).
[09/01/2019-08:44:06] [I] Average over 10 runs is 0.0093216 ms (host walltime is 0.0341824 ms, 99% percentile time is 0.011264).
[09/01/2019-08:44:06] [I] Average over 10 runs is 0.0095232 ms (host walltime is 0.0343413 ms, 99% percentile time is 0.011264).
[09/01/2019-08:44:06] [I] Average over 10 runs is 0.0091104 ms (host walltime is 0.0345789 ms, 99% percentile time is 0.011264).
[09/01/2019-08:44:06] [I] Average over 10 runs is 0.0091136 ms (host walltime is 0.0344361 ms, 99% percentile time is 0.011264).
[09/01/2019-08:44:06] [I] Average over 10 runs is 0.009312 ms (host walltime is 0.0344415 ms, 99% percentile time is 0.011264).
[09/01/2019-08:44:06] [I] Average over 10 runs is 0.0091136 ms (host walltime is 0.0343255 ms, 99% percentile time is 0.011264).
[09/01/2019-08:44:06] [I] Average over 10 runs is 0.0090144 ms (host walltime is 0.0343258 ms, 99% percentile time is 0.011264).
[09/01/2019-08:44:06] [I] Average over 10 runs is 0.0090144 ms (host walltime is 0.0344853 ms, 99% percentile time is 0.011264).
[09/01/2019-08:44:06] [I] Dumping output tensor Output:
[09/01/2019-08:44:06] [I] [1, 4, 5]
[09/01/2019-08:44:06] [I] 0.25 0.25 0.25 0.25 0.25
[09/01/2019-08:44:06] [I] 0.25 0.25 0.25 0.25 0.25
[09/01/2019-08:44:06] [I] 0.25 0.25 0.25 0.25 0.25
[09/01/2019-08:44:06] [I] 0.25 0.25 0.25 0.25 0.25
&&&& PASSED TensorRT.trtexec # trtexec --onnx=softmax_test.onnx --verbose --dumpOutput --batch=1 --safe

Thanks.

I forgot to write my environment.
TensorRT version: NGC 19.09-py3
GPU: Quadro GV100

In short, I doubt about the TensorRT behavior for ONNX Softmax operator.
The attached test case shows different results between versions <= 5.0 and versions >= 5.1.
The TensorRT versions <= 5.0 seems to show the normal Softmax behavior for axis=1. But it seems not to be the ONNX Softmax definition:
https://github.com/onnx/onnx/blob/master/docs/Operators.md#Softmax

The TensorRT versions >= 5.1 seems to show strange behavior.
Would you please check my test case attached, and tell me if there is any wrong?
Thanks.

trt_onnx_softmax_test_20191002.zip (7.15 KB)

Hi,

In TensorRT, axis=0 is used as batchsize axis.
Since the batch size is set to 1, the network will extract tensor dimension into [1,…] automatically.

You can expand the dimension into [1, 3, 4, 5] and reserve the axis=0 for batchsize.
After changing the in_tensor/out_tensor to [1, 3, 4, 5], we can see the expected softmax result from TensorRT.

[I] Average over 10 runs is 0.145193 ms (host walltime is 0.212472 ms, 99% percentile time is 0.178646).
[I] Dumping output tensor Output:
[I] [1, 3, 4, 5]
[I] 0.333333 0.333333 0.333333 0.333333 0.333333
...

Thanks.

AastaLLL,
Thank you for your reply.
I confirmed that the [1, 3, 4, 5] with axis=1 case worked fine as you said.
But the [1, 1, 3, 4, 5] with axis=1 case did not work as I expected.
I think that the all outputs should be 1 for the second 1 in the shape.

[09/07/2019-06:41:06] [I] Dumping output tensor Output:
[09/07/2019-06:41:06] [I] [1, 1, 3, 4, 5]
[09/07/2019-06:41:06] [I] 0.333333 0.333333 0.333333 0.333333 0.333333
[09/07/2019-06:41:06] [I] 0.333333 0.333333 0.333333 0.333333 0.333333
[09/07/2019-06:41:06] [I] 0.333333 0.333333 0.333333 0.333333 0.333333
[09/07/2019-06:41:06] [I] 0.333333 0.333333 0.333333 0.333333 0.333333
[09/07/2019-06:41:06] [I] 0.333333 0.333333 0.333333 0.333333 0.333333
[09/07/2019-06:41:06] [I] 0.333333 0.333333 0.333333 0.333333 0.333333
[09/07/2019-06:41:06] [I] 0.333333 0.333333 0.333333 0.333333 0.333333
[09/07/2019-06:41:06] [I] 0.333333 0.333333 0.333333 0.333333 0.333333
[09/07/2019-06:41:06] [I] 0.333333 0.333333 0.333333 0.333333 0.333333
[09/07/2019-06:41:06] [I] 0.333333 0.333333 0.333333 0.333333 0.333333
[09/07/2019-06:41:06] [I] 0.333333 0.333333 0.333333 0.333333 0.333333
[09/07/2019-06:41:06] [I] 0.333333 0.333333 0.333333 0.333333 0.333333
&&&& PASSED TensorRT.trtexec # trtexec --onnx=softmax_test.onnx --verbose --dumpOutput --batch=1 --safe

Please find the attached softmax_test.onnx which is the test model.

Thanks.
log.txt (7.26 KB)
trt_onnx_softmax_test_20191007.zip (6.54 KB)
softmax_test.zip (292 Bytes)

Hi,

Thanks for your update.
We will check this issue with our internal team and update more information with you later.

Thanks.

AastaLLL,

Thank you for the support.
Can I have a update on your analysis?

Thanks.

Hi,

Sorry for the late update.

We try to run your sample in our environment but meet an error:

Traceback (most recent call last):
  File "trt_onnx_softmax_test.py", line 67, in <module>
    test_model()
  File "trt_onnx_softmax_test.py", line 58, in test_model
    inputs=inputs, outputs=outputs, stream=stream)
  File "/home/nvidia/trt_onnx_softmax_test/common.py", line 143, in do_inference
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
  File "/home/nvidia/trt_onnx_softmax_test/common.py", line 143, in <listcomp>
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
pycuda._driver.LogicError: cuMemcpyHtoDAsync failed: invalid argument

Could you help to check the issue in common.py?
Thanks.

Please use the docker container nvcr.io/nvidia/tensorrt:19.10-py3 on NGC

In the container, install the python dependencies with the following command:

$ /opt/tensorrt/python/python_setup.sh

Run the script (which I attached before) to generate the test ONNX model

$ python trt_onnx_softmax_test.py

Run trtexec for the generated softmax_test.onnx

$ trtexec --onnx=softmax_test.onnx --verbose --dumpOutput --batch=1 --safe

Thanks.

Hi,

This issue is feedback to our internal team.
Will update more information with you once we have further finding.

Thanks.

Hi,

Sorry for the long delay. We got some feedback from our internal team now.

The suggestion is similar.
The first axis is used for batchsize, so it will be overwritten with the parameter using for launching TensorRT.

So if you want to use [3, 4, 5] as input tensor, please redefine it as [1, 3, 4, 5].
Which remains the first axis for batchsize.

If you want to use the first axis as batchsize and get the expected output as well.
Please run the TensorRT with the corresponding batchsize.

/usr/src/tensorrt/bin/trtexec --onnx=softmax_test.onnx --verbose --dumpOutput <b>--batch=3</b> --safe

Thanks.

Thank you for your comment. But I want to know five dimensional case like [1, 1, 3, 4, 5] with axis=1.
Do you mean that the maximum dimension size of tensor in TensorRT is just four?

Hi,

Based on our spec, softmax can support input/output dimension from 1-7:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-601/tensorrt-support-matrix/index.html#layers-matrix

So the maximum dimension should be 7.

Thanks.

Thank you for your reply.

The shape [1, 1, 3, 4, 5] supposed to be no problem.
So my experiment should output the all 1 but the actual output was the all 0.333333.
Is there possibility of a TensorRT issue on the ONNX import?

Hi,

Sorry for keeping you waiting.

After checking with our internal team, this issue is a bug in TensorRT v6.0.
The fix is already in the TensorRT v7.0.
Could you help to give it a try?

Thanks.

AastaLLL,

Thank you for the information.
I have tried TensorRT v7.0. According to my experiments, I have understood as following. Is it correct?

TensorRT 7.0 - You have fixed the ONNX Softmax to TensorRT Softmax conversion to meet ONNX’s Softmax specification.

TensorRT 6.0 - Understands ONNX Softmax’s axis as same as the axis definition of numpy.sum(). The conversion is valid if the input dimension <= 4.

Thanks.

Hi,

Sorry for the late. Suppose the result should be 0.0167 with TensorRT v7.0.

According to the ONNX spec: https://github.com/onnx/onnx/blob/master/docs/Operators.md#Softmax.
For the softmax of [1,1,3,4,5] on axis = 1, the input is first reshaped to [1,60], softmax is done, and then is reshaped back to [1,1,3,4,5].
Assuming all the inputs are the same, which should be the trtexec does, the output values should all be 1/60 - or 0.0167.

Do you get the similar result with v7.0?
Thanks.

AastaLLL,

Thank you for your reply.
Yes, I got 0.0167 in the all 60 output values if I set 1.0 for the all input values with the --loadInputs option.

[01/31/2020-00:46:59] [I] Output Tensors:
[01/31/2020-00:46:59] [I] Output: (1x1x3x4x5)
[01/31/2020-00:46:59] [I] 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667 0.0166667
&&&& PASSED TensorRT.trtexec # trtexec --onnx=softmax_test.onnx --dumpOutput --batch=1 --safe --loadInputs=Input:input.txt

Please close this issue. Thank you.

Hi, could you please show us your data format of the input.txt? We can’t find any doc about the data format…