reduce max is greatly slower than pytorch in tensorrt 5.1.2.2

x = torch.rand(1, 64, 6032, 60).float().cuda()
def test_max(x):
    return x.max(3, keepdim=True)[0]
torch.cuda.synchronize()
t = time.time()
res = test_max(x)
torch.cuda.synchronize()
print(time.time() - t)

Code above takes 1.4ms in a GTX1060, but in tensorrt 5.1.2.2 it takes 3.55ms (measured in profiler):

"(Unnamed Layer* 0) [Reduce]": 3.5529279708862305

Environment:

  • CUDA 10.0
  • pytorch 1.1
  • TensorRT 5.1.2.2
  • Ubuntu 18.04
  • GTX1060 notebook

I’m not understanding what you mean by executing your sample code in tensorrt? can you please clarify?

execute reduce max operation in tensorrt network API with same input and parameters, then use profiler to measure layer time.

layer = net.add_reduce(inp, trt.ReduceOperation.MAX, axis_trt,
                       bool(keepdim))

I have found that use maxpool to implement this kind of reduce max can run faster (but still slower than pytorch).

Another question: Is there any plan to support simple reshape (just change shape/stride of trt.ITensor like torch.view without adding a shuffle layer)? I am a network API user and I find that in profiler the reshape layer may cost some time (5%~10% in ShuffleNetV2).

Hello,

Engineering is working on a fix for this, in the meantime, they recommend the following:

you can add in a unary transform that goes from reduce kMAX to max pooling.
There is already a unary transform that goes from reduce kAVG to avg pooling.

I’m hitting the same issue, could you share the tensorrt version that carries this fix?