TensorRT softmax layer very slow

I used softmax layer in my prototxt and ran it using TensorRT.
It seems to run about 10x slower than caffe softmax layer.

TensorRT 3.0.1
Cudnn 7.0.5
Ubuntu 16.04
GTX 1070

layer {
  name: "mbox_conf_softmax"
  type: "Softmax"
  bottom: "mbox_conf_reshape"
  top: "mbox_conf_softmax"
  softmax_param {
    axis: 2
  }
}

Overall this layer takes more than 50% of the model runtime while in caffe it hardly affects inference time.

How did you measure this?

I compared the following caffe prototxt in caffe time and giexec

name: "sample"
input: "data"
input_shape {
  dim: 1
  dim: 2436
  dim: 4
  dim: 1
}
layer {
  name: "softmax"
  type: "Softmax"
  bottom: "data"
  top: "softmax"
  softmax_param {
    axis: 2
  }
}

./caffe time --model softmax.prototxt -gpu 0
shows “Average Forward pass: 0.0410502 ms.”

./giexec --deploy=softmax.prototxt --output=softmax
shows Average over 10 runs is 0.958691 ms.

I tried nvvp with giexec and I see a single kernel cudnn::detail::softmax_fw_channel_4d_kernel which takes 0.9ms to finish.

Also noticed TensorRT softmax layer ignoes axis parameter