How to check layer precision?

Even if we want to build an engine with fp16 or int8 precision, TensorRT has the freedom to use higher precision is those higher precision layers are faster (unless strict constraints are enforced). Is there a way to know which layers are run in fp32/fp16/int8 after building the engine?

I tried to use layer->getPrecision(), but I always get fp32, even if I ask to build the engine in fp16 or int8. Note that when building the engine in fp16 or int8, the size of the serialized engine is smaller than the fp32 engine. So I think TensorRT has at least selected some lower precision weights/kernels to use during inference. The inference is also somewhat faster. But the network still shows fp32 layers?

The network is parsed from an ONNX file.

A minimal code:

auto builder = std::unique_ptr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(logger_), TrtDeleter());

// int8 with fp16 fallback
builder->setInt8Mode(true);
builder->setFp16Mode(true);
 
auto network = std::shared_ptr<nvinfer1::INetworkDefinition>(builder->createNetwork(), TrtDeleter());
auto parser = std::unique_ptr<nvonnxparser::IParser>(nvonnxparser::createParser(*network, logger_), TrtDeleter());
int severity = static_cast<int>(nvinfer1::ILogger::Severity::kWARNING);
parser->parseFromFile(fn_onnx.c_str(), severity);
auto engine = std::shared_ptr<nvinfer1::ICudaEngine>(builder->buildCudaEngine(*network), TrtDeleter());

int const num_layers = network->getNbLayers();
  for (int ii = 0; ii < num_layers; ++ii) {
    auto layer = network->getLayer(ii);
    nvinfer1::DataType precision = layer->getPrecision();  // this is always nvinfer::DataType::kFLOAT
}

hello,

per engineering: The network layers will always report FP32. The engine layers are the ones which will have different precisions. At the moment, users need to use nvprof to learn about the kernels and infer the precision of the layers from there.

Thank you for your reply.

nvprof indeed provides the names of the kernels that are run. The kernel names are however sometimes a bit cryptic. For example, running ResNet50 in int8 gives
https://drive.google.com/open?id=1DYVnoZIyZrM-RGxSZj06cDqAywCXkeir

and fp16 gives
https://drive.google.com/open?id=1kTHaGy6qYLJDHI2PdRBmRY22u1mzNvwy

Some kernels clearly indicate int8 in their name. But for fp16, lots of kernels indicate h884. From https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#best-practices, I’ve read that gemm kernels with h884 (such as volta_h884gemm_64x64_ldg8_tn) are fp16 kernels run on Tensor Cores. Are kernels with h884_cudnn also fp16 that are run on Tensor Cores?

I’m also confused because Activation layers do not support int8 according to https://docs.nvidia.com/deeplearning/sdk/tensorrt-support-matrix/index.html , but the int8 nvprof profile indicates relu kernels with int8 in its name?