Want to know more about INT8 precision

Hi all, I want to know following details when we configure the option --int8 during trtexec invocation on the command line

  1. I have following clarifications w.r.t the above option
    a. only weight quantization?
    b. only activation quantization?
    c. Dynamic quantization? (where quantization ranges for both weights and activation are computed during the inference dynamically as against fixed)
    d. Hybrid quantization? (where some part of the model are treated with weight only quantized and some part of the model are treated with activation only.
    c. Post training quantization? - where it is a trade-off between model size, inference speed and model accuracy
  2. There is an option to provide the calibration cache file on trtexec command line --calib=. I have following clarifications with this
    a. How does it work in combination with --int8 option.
    b. How this file need to be generated when we already have a pre-trained model.
    c. What if I don’t give this option but only specify --int8 option
    d. The calibration cache generated for one model can be used inferencing other models too?
  3. There are some sample codes related to int8 precision under the directory /usr/src/tensorrt/ and also the source file related to trtexec binary. I tried reading it with respect to all above clarifications but could not understand it properly.
  4. Wanted to have clarifications about which type of quantization it is doing in the sample codes in /usr/src/tensorrt directory w.r.t the clarifications on quantization I have sought.
    Please provide me clarifications w.r.t all questions I have raised, providing me the APIs information or file names of the sample code. Also how we can do a trade-off is done in the sample code between the model size, inference speed and the model accuracy when we specify --int8 option.
    I will be definitely benefitted out of it.

Thanks and Regards

Nagaraj Trivedi

Dear @trivedi.nagaraj,
When you set only --int8 flag, by default, dynamic range is set for all layers with dummy values.
But if calib option(calibration file) is used along with int8, the calibrated data is used to fill the scales.
You can generate the calibration cache using some test data and different calibrators(Developer Guide :: NVIDIA Deep Learning TensorRT Documentation)
As each model have different set of layers and architectures, calibration cache generated for one model can not be used for others.

Thank you SivaRamakrishnan for the clarification.
One point you have mentioned in your reply that when --int8 flag is set dynamic range is set for all layers with dummy values. May I know where in the code for trtexec source code setting the dynamic range is handled (file name and function name) so that I can analyze it in detail. I have seen the source code of trtexec in the sample source code but unable to locate it. If you can point me to that then it will help me a lot. I have also seen the documentation but few things are not clear in that. Particularly with regards to

  1. When we configure the --int8 flag what should be the precision of test image(tensor)? Still it should be fp32 or fp16 or it must be converted to int8?
  2. May I get a sample code to perform dynamic range as you have stated in the reply

Please clarify me on these doubts.

Thanks and Regards

Nagaraj Trivedi

Hi SivaRamaKrishnan, please update me on this.

Thanks and Regards

Nagaraj Trivedi

Dear @trivedi.nagaraj,
The input data will be FP32 even though we set the precision to INT8.
You can check setting of dynamic range(setTensorDynamicRange) in tensorrt/samples/common/sampleEngine.cpp.

Hi SivaRamaKrishnan, thank you for providing me the information.
I have studied this file and particularly related to the int8 calibration.
But in the code I have not found how the fp32 weights are being converted to INT8 precision, where and how it is made use during inference. I have pasted the api/method below and explaining in summary what it does . After this I am asking my clarifications which is/are in bold letters after this explanation following questions w.r.t this api/method.

  1. This api/method loops through all the layers in the network and for each of the layers it does the following
  2. Identifies its dimension, generates random numbers(weights) for the identified dimension. These weights are in fp32 format
  3. Performs cudaMalloc and cudaMemCpy to store these randomly generated weights in the pointer data
  4. Makes a pair of the input layer and the data and inserts them into mInputDeviceBuffers using mInputDeviceBuffers.insert()

I have these questions
1. The random weights generated are of fp32 type. How does it convert them to int8 during inference?
2. It works even with when only int8 options is supplied but not the calibration cache file
3. The question is during inference from where does it get the int8 weights so that weights(size) of the model being inferenced gets reduced. When you are going to provide me the clarification also point me to the file(s) and api(s) where the answers to my questions will be present. Below is the code I have pasted.

RndInt8Calibrator::RndInt8Calibrator(int batches, std::vector<int64_t>& elemCount, const std::string& cacheFile,
const INetworkDefinition& network, std::ostream& err)
: mBatches(batches)
, mCurrentBatch(0)
, mCacheFile(cacheFile)
, mErr(err)
{
std::ifstream tryCache(cacheFile, std::ios::binary);
if (tryCache.good())
{
return;
}

std::default_random_engine generator;
std::uniform_real_distribution<float> distribution(-1.0F, 1.0F);
auto gen = [&generator, &distribution]() { return distribution(generator); };

for (int i = 0; i < network.getNbInputs(); i++)
{
    auto* input = network.getInput(i);
    std::vector<float> rnd_data(elemCount[i]);
    std::generate_n(rnd_data.begin(), elemCount[i], gen);

    void* data;
    cudaCheck(cudaMalloc(&data, elemCount[i] * sizeof(float)), mErr);
    cudaCheck(cudaMemcpy(data, rnd_data.data(), elemCount[i] * sizeof(float), cudaMemcpyHostToDevice), mErr);

    mInputDeviceBuffers.insert(std::make_pair(input->getName(), data));
}

}

Dear @trivedi.nagaraj,
When you set the precision builder flag, the TRT build engine module takes care of the weights precision conversion and selection of optimal kernel implementations for each layer in inference. All this will be handled by TRT framework and developers just have to set the precision flag.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.