RELU Bottleneck on Jetson TX2 w/ TensorRT

When we run our network using TensorRT on Jetson TX2 (JetPack 3.2 Production), we have a fairly significant bottleneck with a RELU function, trt_maxwell_scudnn_128x64_relu_small_nn_v1, taking 86% of the execution time (measured by NVPROF). We are a little surprised to see a Maxwell kernel, so I’m wondering if there is something that is causing what appears to be an unoptimized kernel for TX2 to be selected. Is there a way you would advise to speed this up? The TRT CUDA engine was built on the TX2, following the workflow proposed here:

Interestingly enough, we see zero speedup and the same amount of memory being allocated when going to float16s, leading us to think we are missing something or this RELU kernel does not support float16. fp16 support is being enabled via “builder->setHalf2Mode(true);” when we create the engine. We are guessing that Maxwell kernel is the culprit, since it appears to be selected regardless what data type we select when building the engine.

Is there a way to game the system to choose a different RELU kernel?


We want to reproduce this issue internally.
Could you share your nvprof and model file with us?



TensorRT optimizes network based on the GPU architecture.
One of technique is to fuse possible layers into a single layer
Fusion of convolution/bias/ReLU is commonly used in TensorRT optimization.

An experiment is to remove the relu op and you will still get the similar profiling result.
Check this blog to know more about our optimization technique:

That is, although the layer name is relu (named after the last fused layer), it is a combination of convolution, bias and ReLU layer.
For a convolution-based network, it is normal that convolution takes most of execution time.



That make sense, and I guess I should have realized that. Thank you very much for the answer.