When we run our network using TensorRT on Jetson TX2 (JetPack 3.2 Production), we have a fairly significant bottleneck with a RELU function, trt_maxwell_scudnn_128x64_relu_small_nn_v1, taking 86% of the execution time (measured by NVPROF). We are a little surprised to see a Maxwell kernel, so I’m wondering if there is something that is causing what appears to be an unoptimized kernel for TX2 to be selected. Is there a way you would advise to speed this up? The TRT CUDA engine was built on the TX2, following the workflow proposed here:
https://devtalk.nvidia.com/default/topic/1030508/jetson-tx2/jetson-tx2-tensorflow-tensorrt-workflow/.
Interestingly enough, we see zero speedup and the same amount of memory being allocated when going to float16s, leading us to think we are missing something or this RELU kernel does not support float16. fp16 support is being enabled via “builder->setHalf2Mode(true);” when we create the engine. We are guessing that Maxwell kernel is the culprit, since it appears to be selected regardless what data type we select when building the engine.
Is there a way to game the system to choose a different RELU kernel?