Benchmark for cuDNN ,CUDA C and TensorRT layer latancy on Tesla GPU?

Implementing neural net online inference,layer latency limits the max concurrent server program on one GPU.We used CUDA C handwriting kernel and cublas compose the net inference program with float data type.I am trying optimizing the program.Does someone have test the latency between cuDNN ,CUDA C and TensorRT layer latancy on Tesla GPU with float data type?
Thanks.

TensorRT should allow you to get good overall latency and throughput. However, it takes as input a whole graph of layers and internally does layer fusion and other optimizations.

The 1.0 (and 2.0 EA) on the web don’t include support for custom layers but that support is coming soon in our next public release. So your handwritten kernel can be used most easily with the forthcoming version.

In the meantime you can try TensorRT 1.0 with a standard network (the samples show you how to download and benchmark googlenet and similar networks) and see what kinds of performance and latency benefits it provides.