TensorRT Winograd performance on TX-2

I just profiled my RefineDet based network on a TX-2, and found the trtwell_fp16x2_hcudnn_winograd_fp16x2_128x128_ldg1_ldg4_relu_tile148m_nt algorithm utilization/occupancy is listed as 25%, due to register usage (128 per thread). Since this algorithm is main algorithm for most any CNN network, it is invoked many times, and therefore seems a good candidate for further optimization. Just wondering if Nvidia has evaluated tradeoff of reducing register usage for TX-2 (sm_62)?

Anyone know of any updates coming for such algorithms in cudnn library for TX2 (aarch-arm)?