How to test the fp16 benchmark performance on tx1?

hello,I am now evaluating the tx1 fp16 calculated performance.but I can’t get the theoretical 1TFLOPS .
many methods have been used.I only get 0.86TFLOPS.
My experimental steps are as follows.
1, I load the before run the test code.and used tegrastats to monitor the tx1,for sure the gpu clock is at 998mHz.
2,the test code is bellow :using matrix multiplication to test computing performance

3, I also write some simple test programs base on fp16’addition and multiplication .but no theoretical results 1Tflops were obtained.

What is the cause of this result?Is there any way to get the theoretical 1TFLOPS? thanks.


Please test it with conv_sample within /usr/src/cudnn_samples_v7/conv_sample.