Even if we want to build an engine with fp16 or int8 precision, TensorRT has the freedom to use higher precision is those higher precision layers are faster (unless strict constraints are enforced). Is there a way to know which layers are run in fp32/fp16/int8 after building the engine?
I tried to use layer->getPrecision(), but I always get fp32, even if I ask to build the engine in fp16 or int8. Note that when building the engine in fp16 or int8, the size of the serialized engine is smaller than the fp32 engine. So I think TensorRT has at least selected some lower precision weights/kernels to use during inference. The inference is also somewhat faster. But the network still shows fp32 layers?
The network is parsed from an ONNX file.
A minimal code:
auto builder = std::unique_ptr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(logger_), TrtDeleter());
// int8 with fp16 fallback
builder->setInt8Mode(true);
builder->setFp16Mode(true);
auto network = std::shared_ptr<nvinfer1::INetworkDefinition>(builder->createNetwork(), TrtDeleter());
auto parser = std::unique_ptr<nvonnxparser::IParser>(nvonnxparser::createParser(*network, logger_), TrtDeleter());
int severity = static_cast<int>(nvinfer1::ILogger::Severity::kWARNING);
parser->parseFromFile(fn_onnx.c_str(), severity);
auto engine = std::shared_ptr<nvinfer1::ICudaEngine>(builder->buildCudaEngine(*network), TrtDeleter());
int const num_layers = network->getNbLayers();
for (int ii = 0; ii < num_layers; ++ii) {
auto layer = network->getLayer(ii);
nvinfer1::DataType precision = layer->getPrecision(); // this is always nvinfer::DataType::kFLOAT
}