I have found that use maxpool to implement this kind of reduce max can run faster (but still slower than pytorch).
Another question: Is there any plan to support simple reshape (just change shape/stride of trt.ITensor like torch.view without adding a shuffle layer)? I am a network API user and I find that in profiler the reshape layer may cost some time (5%~10% in ShuffleNetV2).
Engineering is working on a fix for this, in the meantime, they recommend the following:
you can add in a unary transform that goes from reduce kMAX to max pooling.
There is already a unary transform that goes from reduce kAVG to avg pooling.