cudnn dilated convolution low efficiency

Hi. I’m using cudnn for dilated convolution.

I use cudnnGetConvolutionForwardAlgorithm() and cudnnGetConvolutionForwardWorkspaceSize() and got
algo CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM and workspace size 0.

The efficiency seems to be low compared to caffe, which implements dilated convolution via cublas gemm.

For a better performance, How can I improve my cudnn dilated convolution? Or should I switch to gemm? Thank you.

BTW, I’m using titanxp with cuda9.0 and cudnn7.4.1