TensorRT 2 INT8 samples

I’m testing the performance of TensorRT 2 using INT8 inference on P4.

A sample_int8.cpp is provided along with the installation of TensorRT 2, which can be used for testing performance of networks including googlenet, VGG and resnet. The flow of the sample is basically reading batches of data for calibration first, then reading other batches for INT8 inference and calculate the accuracy Top1 and Top5 scores.

The problem is that there are few documentation on the data structure to be fed to the network for RGB images, which is used for classification in googlenet, VGG and resnet. The sample of MNIST reads grayscale images and the sample of googlenet (a separate sample file) read dummy data.

From the source code, it can be inferred that the image data should be written in a binary file with filenames “batch0”, “batch1” … under “batches” directory.

Is there any example for preparing RGB images for use in the sample_int8.cpp? Any help is appreciated.

Also, regarding the inference performance, according to https://developer.nvidia.com/tensorrt, P40 card have INT8 inference on googlenet at about 6500 fps. Since the TOPS of P4 is approximately half of P40, the expected inference performance should be around 3000 fps.

But when running the sample_int8.cpp, the performance obtained is less than 1000 fps. Is there anything I am mising? I followed the installation guide and successfully installed cuda 8.0 and TensorRT 2 on Ubuntu 14.04.

QinglinTian,

Good questions.

RGB input data: We’ve talked about how much to go into the process of preparing incoming images prior to feeding those images into the first layer of a neural network. I don’t think any of the samples that come with TensorRT include that. I’ve sent an email around to some of my colleagues to see if any of them have any suggestions.

P40 perf: What batch size are you using? I am looking at our internal perf numbers and the batch size will matter a fair amount. Try 128 images / batch.

I’d like to know what you find.

QinglinTian,

Some of my colleagues chimed in to point me to a few samples you might want to check out that do RGB processing.

https://github.com/dusty-nv/jetson-inference/blob/master/imageNet.cu
Converts RGB to planar BGR format that Alexnet and Googlenet derivatives use and applies mean value subtraction. CUDA function called by the network primitive classes before calling TensorRT.

The GRE inference demo does this as well. Specifically, see the preprocess subroutine in https://github.com/NVIDIA/gpu-rest-engine/blob/master/inference/classification.cpp

Hope this helps!

-Chris

Update: it turns out that my P4 card got hardware defect

============================================================

I managed to get reasonable accuracy result using VGG16.

Another question arises when using VGG16 network is that in the sampleINT8.cpp, there is a calibration table defining cutoff and quantileIndex for different networks and VGG16 is missing. What is the recommended calibration parameter setting for VGG16 network? How is this value determined?

Regarding the performance, the stats from the log indicates that the time for processing an image is slightly longer than 1ms, so the frame rate is less than 1000 fps@batch size 100. I also tried batchsize 128 with changed “batch0, batch1, …, batch9” file and configurations in sample_int8.cpp but the performance is similar, at less than 1000 fps.

Also, I noticed that the accuracy loss for INT8 when compared to FP32 is relatively low while what I’m getting on Alexnet and VGG is obviously higher.

Can you provide more details on the evaluation of accuracy?

I can’t find ‘int8’ folder in my gie_samples. Can you tell me where did you get it?

are you using TensorRT version 2?

yes.

i’m not sure where the problem is. i don’t have a gie_samples dir, but a tensorrt dir

Hello QinglinTian

I have tested ./sample_int8 on 1080 TI and got below accuracy, is this reasonable ? :
:~/no_backup/d1230/TensorRT-2.1.2/bin> ./sample_int8 mnist

INT8 run:400 batches of size 100 starting at 100

Top1: 0.9918, Top5: 1
Processing 40000 images averaged 0.00144474 ms/image and 0.144474 ms/batch.

FP32 run:400 batches of size 100 starting at 100

Top1: 0.9918, Top5: 1
Processing 40000 images averaged 0.00220669 ms/image and 0.220669 ms/batch.

I would like to test the same INT8 inference on Googlenet, Resnet and VGG
Can you please share what to change in sampleint8.cpp , how to generate the batches, etc.
It would be great if you can share what all things are required to run INT8 on Googlenet, VGG and Resnet

Thanks a lot !!!