cuDNN fp16 Support

Based on benchmarking as well as looking at the output of cuobjdump on the kernels used when enabling half precision compute for cuDNN, it looks like the kernels included with v4 as well as the RC of v5 are simply converting fp16 data to fp32 then performing computation in fp32. Based on this, I have some questions:

  • Am I just doing something wrong – are the convolution routines on Tegra implemented for half precision types?
  • If I’m not doing something wrong, is there a timeline as to when the implementation using half precision compute will be available?

Only “FP16 math” convolution forward are supported on Tegra.

To enable it, you need to set the datatype parameter to CUDNN_DATA_HALF when calling cudnnSetConvolutionNdDescriptor or cudnnSetConvolution2dDescriptor_v5

Of course, the input tensor and output tensor need also to be of datatype CUDNN_DATA_HALF

If you call cudnnSetConvolutionNdDescriptor with datatype CUDNN_DATA_FLOAT but the tensor are of type CUDNN_DATA_HALF, then the input are converted from fp16 -> fp32 and the math are done in FP32 and the output is converted back to FP16

Depending on your convolution config, even doing convolution math in FP16 might not bring much speedup.

Both the convolution and tensor descriptors are set to use CUDNN_DATA_HALF. Depending on the convolution config, it looks like it’s calling either:

  • maxwell_fp16_scudnn_fp16_128x64_small_nn
  • maxwell_fp16_scudnn_winograd_fp16_128x128_mobile_tile148t_nt

which, based on the naming, look to me like fp16 functions. However, looking at the results of cuobjdump, these functions are calling F2F and converting the data to a full precision float before doing any computation (unless they defer computation to other functions that I don’t see).

Based on the functions cudnn is calling, does it look like I’m specifying anything incorrectly? Is there anything else I could try?

It seems that you are still doing “pseudo-Fp16” e.g tensor in FP16 but computation in FP32

How do you call cudnnSetConvolutionNdDescriptor ?
The last parameter :cudnnDataType_t computeTyoe need to be set at CUDNN_DATA_HALF if you want to do computation in FP16. in that case, you should see a kernel with a name containing hcudnn instead of scudnn