Inference Time is not stable

Uestc_vigor · December 27, 2018, 12:25pm

Hi, I am working with TX2 and I wanna use TensorRT to accelerate the inference time of YOLOv3-lite on TX2. But when I input the image with 320x320, the inference time of fp32 and fp16 is unstable. eg. time of fp32 was about 70 ms to 200ms and fp16 was about 70 ms to 200 ms in many times single image testing. So what the problem probably caused it? Thanks!

Uestc_vigor · December 27, 2018, 1:01pm

There is another problem when I install TensorRT on my computer with tar version. I can install .whl in python but I cannot find “whidh tensorrt”, which means nothing in /usr/lib/bin…

NVES · December 28, 2018, 12:10am

Regarding the “which tensorrt” question, there is no binary called tensorrt. One way to verify tensorrt installation is by starting python and

import tensorrt as trt

NVES · December 28, 2018, 12:16am

Regarding the inference time, can you please share a small repro that contains your inference code, model, and sample inference images that demonstrate the performance?

Also, what version of Tensorrt and tensorflow (if applicable) are you using?

Uestc_vigor · December 28, 2018, 3:58am

Hi, NVES! I cannot import tensorrt, too.The version of TRT I wanna install on my computer is 4.0.1.6

Uestc_vigor · December 28, 2018, 4:17am

Hi, The TRT version on my TX2 is 4(Jetpack 3.3) and I used caffemode and prototxt to generate the inference engine.

NVES · December 29, 2018, 5:21pm

In response to comment #5, can you please list the installation steps you used? and specific errors/messages you are seeing? To remove dependency issues, you may also want to consider NVIDIA Graphics Cloud nvcr.io/nvidia/tensorrt:18.12-py3 container. https://www.nvidia.com/en-us/gpu-cloud/ account is free.

NVES · December 29, 2018, 5:22pm

In response to comment #6, can you please share a small repro that contains your inference code, model, and sample inference images that demonstrate the performance? screenshots won’t help us debug.

regards,
NVIDIA Enterprise Support

Uestc_vigor · December 30, 2018, 4:54am

namespace Tn
{
    trtNet::trtNet(const std::string& prototxt,const std::string& caffemodel,const std::vector<std::string>& outputNodesName,
                    const std::vector<std::vector<float>>& calibratorData,RUN_MODE mode /*= RUN_MODE::FLOAT32*/)
    :mTrtContext(nullptr),mTrtEngine(nullptr),mTrtRunTime(nullptr),mTrtRunMode(mode),mTrtInputCount(0),mTrtIterationTime(0)
    {
        std::cout << "init plugin proto: " << prototxt << " caffemodel: " << caffemodel << std::endl;
        auto parser = createCaffeParser();

        const int maxBatchSize = 1;
        IHostMemory* trtModelStream{nullptr};

        Int8EntropyCalibrator * calibrator = nullptr;
        if (calibratorData.size() > 0 ){
            auto endPos= prototxt.find_last_of(".");
	        auto beginPos= prototxt.find_last_of('/') + 1;
            std::string calibratorName = prototxt.substr(beginPos,endPos - beginPos);
            std::cout << "create calibrator,Named:" << calibratorName << std::endl;
            calibrator = new Int8EntropyCalibrator(maxBatchSize,calibratorData,calibratorName);
        }

        PluginFactory pluginFactorySerialize;
        ICudaEngine* tmpEngine = loadModelAndCreateEngine(prototxt.c_str(),caffemodel.c_str(), maxBatchSize, parser, &pluginFactorySerialize, calibrator, trtModelStream,outputNodesName);
        assert(tmpEngine != nullptr);
        assert(trtModelStream != nullptr);
        if(calibrator){
            delete calibrator;
            calibrator = nullptr;
        }
        tmpEngine->destroy();
        pluginFactorySerialize.destroyPlugin();

        mTrtRunTime = createInferRuntime(gLogger);
        assert(mTrtRunTime != nullptr);
        mTrtEngine= mTrtRunTime->deserializeCudaEngine(trtModelStream->data(), trtModelStream->size(), &mTrtPluginFactory);
        assert(mTrtEngine != nullptr);
        // Deserialize the engine.
        trtModelStream->destroy();

        InitEngine();
    }

    trtNet::trtNet(const std::string& engineFile)
    :mTrtContext(nullptr),mTrtEngine(nullptr),mTrtRunTime(nullptr),mTrtRunMode(RUN_MODE::FLOAT32),mTrtInputCount(0),mTrtIterationTime(0)
    {
        using namespace std;
        fstream file;
        
        file.open(engineFile,ios::binary | ios::in);
        if(!file.is_open())
        {
            cout << "read engine file" << engineFile <<" failed" << endl;
            return;
        }
        file.seekg(0, ios::end); 
        int length = file.tellg();         
        file.seekg(0, ios::beg); 
        std::unique_ptr<char[]> data(new char[length]);
        file.read(data.get(), length);

        file.close();

        std::cout << "*** deserializing" << std::endl;
        mTrtRunTime = createInferRuntime(gLogger);
        assert(mTrtRunTime != nullptr);
        mTrtEngine= mTrtRunTime->deserializeCudaEngine(data.get(), length, &mTrtPluginFactory);
        assert(mTrtEngine != nullptr);

        InitEngine();
    }

    void trtNet::InitEngine()
    {
        const int maxBatchSize = 1;
        mTrtContext = mTrtEngine->createExecutionContext();
        assert(mTrtContext != nullptr);
        mTrtContext->setProfiler(&mTrtProfiler);

        // Input and output buffer pointers that we pass to the engine - the engine requires exactly IEngine::getNbBindings()
        int nbBindings = mTrtEngine->getNbBindings();

        mTrtCudaBuffer.resize(nbBindings);
        mTrtBindBufferSize.resize(nbBindings);
        for (int i = 0; i < nbBindings; ++i)
        {
            Dims dims = mTrtEngine->getBindingDimensions(i);
            DataType dtype = mTrtEngine->getBindingDataType(i);
            int64_t totalSize = volume(dims) * maxBatchSize * getElementSize(dtype);
            mTrtBindBufferSize[i] = totalSize;
            mTrtCudaBuffer[i] = safeCudaMalloc(totalSize);
            if(mTrtEngine->bindingIsInput(i))
                mTrtInputCount++;
        }

        CUDA_CHECK(cudaStreamCreate(&mTrtCudaStream));
    }

nvinfer1::ICudaEngine* trtNet::loadModelAndCreateEngine(const char* deployFile, const char* modelFile,int maxBatchSize,
                                        ICaffeParser* parser, nvcaffeparser1::IPluginFactory* pluginFactory,
                                        IInt8Calibrator* calibrator, IHostMemory*& trtModelStream,const std::vector<std::string>& outputNodesName)
    {
        // Create the builder
        IBuilder* builder = createInferBuilder(gLogger);

        // Parse the model to populate the network, then set the outputs.
        INetworkDefinition* network = builder->createNetwork();
        parser->setPluginFactory(pluginFactory);

        std::cout << "Begin parsing model..." << std::endl;
        const IBlobNameToTensor* blobNameToTensor = parser->parse(deployFile,modelFile, *network, nvinfer1::DataType::kFLOAT);
        if (!blobNameToTensor)
            RETURN_AND_LOG(nullptr, ERROR, "Fail to parse");
        std::cout << "End parsing model..." << std::endl;

        // specify which tensors are outputs
        for (auto& name : outputNodesName)
        {
            auto output = blobNameToTensor->find(name.c_str());
            assert(output!=nullptr);
            network->markOutput(*output);
        }

        // Build the engine.
        builder->setMaxBatchSize(maxBatchSize);
        builder->setMaxWorkspaceSize(1 << 30);// 1G
        if (mTrtRunMode == RUN_MODE::INT8)
        {
             std::cout <<"setInt8Mode"<<std::endl;
            if (!builder->platformHasFastInt8())
                std::cout << "Notice: the platform do not has fast for int8" << std::endl;
            builder->setInt8Mode(true);
            builder->setInt8Calibrator(calibrator);
        }
        else if (mTrtRunMode == RUN_MODE::FLOAT16)
        {
            std::cout <<"setFp16Mode"<<std::endl;
            if (!builder->platformHasFastFp16())
                std::cout << "Notice: the platform do not has fast for fp16" << std::endl;
            builder->setFp16Mode(true);
        }

        std::cout << "Begin building engine..." << std::endl;
        ICudaEngine* engine = builder->buildCudaEngine(*network);
        if (!engine)
            RETURN_AND_LOG(nullptr, ERROR, "Unable to create engine");
        std::cout << "End building engine..." << std::endl;

        // We don't need the network any more, and we can destroy the parser.
        network->destroy();
        parser->destroy();

        // Serialize the engine, then close everything down.
        trtModelStream = engine->serialize();

        builder->destroy();
        shutdownProtobufLibrary();
        return engine;
    }

    void trtNet::doInference(const void* inputData, void* outputData)
    {
        static const int batchSize = 1;
        assert(mTrtInputCount == 1);

        // DMA the input to the GPU,  execute the batch asynchronously, and DMA it back:
        int inputIndex = 0;
        CUDA_CHECK(cudaMemcpyAsync(mTrtCudaBuffer[inputIndex], inputData, mTrtBindBufferSize[inputIndex], cudaMemcpyHostToDevice, mTrtCudaStream));
        auto t_start = std::chrono::high_resolution_clock::now();
        mTrtContext->execute(batchSize, &mTrtCudaBuffer[inputIndex]);
        auto t_end = std::chrono::high_resolution_clock::now();
        float total = std::chrono::duration<float, std::milli>(t_end - t_start).count();

        std::cout << "Time taken for inference is " << total << " ms." << std::endl;

        for (size_t bindingIdx = mTrtInputCount; bindingIdx < mTrtBindBufferSize.size(); ++bindingIdx)
        {
            auto size = mTrtBindBufferSize[bindingIdx];
            CUDA_CHECK(cudaMemcpyAsync(outputData, mTrtCudaBuffer[bindingIdx], size, cudaMemcpyDeviceToHost, mTrtCudaStream));
            outputData = (char *)outputData + size;
        }

        mTrtIterationTime ++ ;
    }
}

This is the code.

Uestc_vigor · January 3, 2019, 2:06am

Hi,NVES! I found that there is no acceleration of deconvolution layer in tensorrt on TX2.

Uestc_vigor · January 3, 2019, 3:44am

There is a precision and speed problem. When I use net->forward() in caffe. The performance of the model is good. But when I use TRT (the doinference code above) for fp32 reasoning, the accuracy dropped a lot, and so is the speed.

The code of caffe shows below:

clock_t time;
time = clock();
net_->Forward();
printf("Predicted in %f seconds.\n",  sec(clock() - time));

The result time shows below( about 50ms):

Predicted in 0.058343 seconds.
_003240 1 0.999893 759 618 859 801
_003240 1 0.999175 666 587 785 806
_003240 1 0.999078 868 469 1168 1101
_003240 1 0.994044 507 597 607 779
_003240 1 0.866441 1032 491 1335 1129
_003240 1 0.825324 631 631 670 705
Predicted in 0.047801 seconds.
_003241 1 0.99984 766 615 870 805
_003241 1 0.999573 668 588 790 807
_003241 1 0.999024 515 600 613 782
_003241 1 0.998149 882 476 1171 1101
_003241 1 0.93445 1040 499 1343 1117
_003241 1 0.267726 634 632 673 706
Predicted in 0.051120 seconds.
_003242 1 0.99997 514 594 619 791
_003242 1 0.999829 771 614 871 805
_003242 1 0.999571 665 569 799 825
_003242 1 0.998126 892 474 1193 1097
_003242 1 0.980864 1032 496 1352 1118
_003242 1 0.517655 639 632 677 704
Predicted in 0.055676 seconds.
_003243 1 0.999964 773 614 875 806
_003243 1 0.999836 516 593 618 792
_003243 1 0.998531 665 568 801 820
_003243 1 0.997485 894 483 1224 1080
_003243 1 0.993176 1077 518 1388 1092
_003243 1 0.988622 852 627 952 821
_003243 1 0.624096 640 632 678 702
Predicted in 0.055876 seconds.
_003244 1 0.999974 785 619 881 812
_003244 1 0.999909 520 591 624 789
_003244 1 0.999094 860 629 957 824
_003244 1 0.998772 667 573 804 817
_003244 1 0.997068 914 474 1236 1102
_003244 1 0.961674 1086 507 1393 1103
_003244 1 0.50607 648 634 687 703
Predicted in 0.055807 seconds.
_003245 1 0.99998 784 620 888 814
_003245 1 0.99992 523 590 628 789
_003245 1 0.999756 866 632 966 827
_003245 1 0.998979 673 570 809 820
_003245 1 0.996583 925 474 1236 1111
_003245 1 0.855668 1110 524 1413 1083
_003245 1 0.579327 654 633 693 707
Predicted in 0.051894 seconds.
_003246 1 0.999946 784 621 890 819
_003246 1 0.999903 530 589 640 796
_003246 1 0.99956 876 633 975 840
_003246 1 0.998763 936 462 1238 1124
_003246 1 0.998218 675 576 812 817
_003246 1 0.92333 1138 506 1429 1105
_003246 1 0.397051 659 634 698 705

The inference code of TRT is showed below:

namespace Tn
{
    trtNet::trtNet(const std::string& prototxt,const std::string& caffemodel,const std::vector<std::string>& outputNodesName,
                    const std::vector<std::vector<float>>& calibratorData,RUN_MODE mode /*= RUN_MODE::FLOAT32*/)
    :mTrtContext(nullptr),mTrtEngine(nullptr),mTrtRunTime(nullptr),mTrtRunMode(mode),mTrtInputCount(0),mTrtIterationTime(0)
    {
        std::cout << "init plugin proto: " << prototxt << " caffemodel: " << caffemodel << std::endl;
        auto parser = createCaffeParser();

        const int maxBatchSize = 1;
        IHostMemory* trtModelStream{nullptr};

        Int8EntropyCalibrator * calibrator = nullptr;
        if (calibratorData.size() > 0 ){
            auto endPos= prototxt.find_last_of(".");
	        auto beginPos= prototxt.find_last_of('/') + 1;
            std::string calibratorName = prototxt.substr(beginPos,endPos - beginPos);
            std::cout << "create calibrator,Named:" << calibratorName << std::endl;
            calibrator = new Int8EntropyCalibrator(maxBatchSize,calibratorData,calibratorName);
        }

        PluginFactory pluginFactorySerialize;
        ICudaEngine* tmpEngine = loadModelAndCreateEngine(prototxt.c_str(),caffemodel.c_str(), maxBatchSize, parser, &pluginFactorySerialize, calibrator, trtModelStream,outputNodesName);
        assert(tmpEngine != nullptr);
        assert(trtModelStream != nullptr);
        if(calibrator){
            delete calibrator;
            calibrator = nullptr;
        }
        tmpEngine->destroy();
        pluginFactorySerialize.destroyPlugin();

        mTrtRunTime = createInferRuntime(gLogger);
        assert(mTrtRunTime != nullptr);
        mTrtEngine= mTrtRunTime->deserializeCudaEngine(trtModelStream->data(), trtModelStream->size(), &mTrtPluginFactory);
        assert(mTrtEngine != nullptr);
        // Deserialize the engine.
        trtModelStream->destroy();

        InitEngine();
    }

    trtNet::trtNet(const std::string& engineFile)
    :mTrtContext(nullptr),mTrtEngine(nullptr),mTrtRunTime(nullptr),mTrtRunMode(RUN_MODE::FLOAT32),mTrtInputCount(0),mTrtIterationTime(0)
    {
        using namespace std;
        fstream file;
        
        file.open(engineFile,ios::binary | ios::in);
        if(!file.is_open())
        {
            cout << "read engine file" << engineFile <<" failed" << endl;
            return;
        }
        file.seekg(0, ios::end); 
        int length = file.tellg();         
        file.seekg(0, ios::beg); 
        std::unique_ptr<char[]> data(new char[length]);
        file.read(data.get(), length);

        file.close();

        std::cout << "*** deserializing" << std::endl;
        mTrtRunTime = createInferRuntime(gLogger);
        assert(mTrtRunTime != nullptr);
        mTrtEngine= mTrtRunTime->deserializeCudaEngine(data.get(), length, &mTrtPluginFactory);
        assert(mTrtEngine != nullptr);

        InitEngine();
    }

    void trtNet::InitEngine()
    {
        const int maxBatchSize = 1;
        mTrtContext = mTrtEngine->createExecutionContext();
        assert(mTrtContext != nullptr);
        mTrtContext->setProfiler(&mTrtProfiler);

        // Input and output buffer pointers that we pass to the engine - the engine requires exactly IEngine::getNbBindings()
        int nbBindings = mTrtEngine->getNbBindings();

        mTrtCudaBuffer.resize(nbBindings);
        mTrtBindBufferSize.resize(nbBindings);
        for (int i = 0; i < nbBindings; ++i)
        {
            Dims dims = mTrtEngine->getBindingDimensions(i);
            DataType dtype = mTrtEngine->getBindingDataType(i);
            int64_t totalSize = volume(dims) * maxBatchSize * getElementSize(dtype);
            mTrtBindBufferSize[i] = totalSize;
            mTrtCudaBuffer[i] = safeCudaMalloc(totalSize);
            if(mTrtEngine->bindingIsInput(i))
                mTrtInputCount++;
        }

        CUDA_CHECK(cudaStreamCreate(&mTrtCudaStream));
    }


    nvinfer1::ICudaEngine* trtNet::loadModelAndCreateEngine(const char* deployFile, const char* modelFile,int maxBatchSize,
                                        ICaffeParser* parser, nvcaffeparser1::IPluginFactory* pluginFactory,
                                        IInt8Calibrator* calibrator, IHostMemory*& trtModelStream,const std::vector<std::string>& outputNodesName)
    {
        // Create the builder
        IBuilder* builder = createInferBuilder(gLogger);

        // Parse the model to populate the network, then set the outputs.
        INetworkDefinition* network = builder->createNetwork();
        parser->setPluginFactory(pluginFactory);

        std::cout << "Begin parsing model..." << std::endl;
        const IBlobNameToTensor* blobNameToTensor = parser->parse(deployFile,modelFile, *network, nvinfer1::DataType::kFLOAT);
        if (!blobNameToTensor)
            RETURN_AND_LOG(nullptr, ERROR, "Fail to parse");
        std::cout << "End parsing model..." << std::endl;

        // specify which tensors are outputs
        for (auto& name : outputNodesName)
        {
            auto output = blobNameToTensor->find(name.c_str());
            assert(output!=nullptr);
            network->markOutput(*output);
        }

        // Build the engine.
        builder->setMaxBatchSize(maxBatchSize);
        builder->setMaxWorkspaceSize(1 << 30);// 1G
        if (mTrtRunMode == RUN_MODE::INT8)
        {
             std::cout <<"setInt8Mode"<<std::endl;
            if (!builder->platformHasFastInt8())
                std::cout << "Notice: the platform do not has fast for int8" << std::endl;
            builder->setInt8Mode(true);
            builder->setInt8Calibrator(calibrator);
        }
        else if (mTrtRunMode == RUN_MODE::FLOAT16)
        {
            std::cout <<"setFp16Mode"<<std::endl;
            if (!builder->platformHasFastFp16())
                std::cout << "Notice: the platform do not has fast for fp16" << std::endl;
            builder->setFp16Mode(true);
        }

        std::cout << "Begin building engine..." << std::endl;
        ICudaEngine* engine = builder->buildCudaEngine(*network);
        if (!engine)
            RETURN_AND_LOG(nullptr, ERROR, "Unable to create engine");
        std::cout << "End building engine..." << std::endl;

        // We don't need the network any more, and we can destroy the parser.
        network->destroy();
        parser->destroy();

        // Serialize the engine, then close everything down.
        trtModelStream = engine->serialize();

        builder->destroy();
        shutdownProtobufLibrary();
        return engine;
    }

    void trtNet::doInference(const void* inputData, void* outputData)
    {
        static const int batchSize = 1;
        assert(mTrtInputCount == 1);

        // DMA the input to the GPU,  execute the batch asynchronously, and DMA it back:
        int inputIndex = 0;
        CUDA_CHECK(cudaMemcpyAsync(mTrtCudaBuffer[inputIndex], inputData, mTrtBindBufferSize[inputIndex], cudaMemcpyHostToDevice, mTrtCudaStream));
        auto t_start = std::chrono::high_resolution_clock::now();
        mTrtContext->execute(batchSize, &mTrtCudaBuffer[inputIndex]);
        auto t_end = std::chrono::high_resolution_clock::now();
        float total = std::chrono::duration<float, std::milli>(t_end - t_start).count();

        std::cout << "Time taken for inference is " << total << " ms." << std::endl;

        for (size_t bindingIdx = mTrtInputCount; bindingIdx < mTrtBindBufferSize.size(); ++bindingIdx)
        {
            auto size = mTrtBindBufferSize[bindingIdx];
            CUDA_CHECK(cudaMemcpyAsync(outputData, mTrtCudaBuffer[bindingIdx], size, cudaMemcpyDeviceToHost, mTrtCudaStream));
            outputData = (char *)outputData + size;
        }

        mTrtIterationTime ++ ;
    }
}

And the result of TRT shows below( about 70ms):

Time taken for inference is 72.3779 ms.
conv0 + conv0/relu                       0.573ms
conv1 + conv1/relu                       3.041ms
conv2 + conv2/relu                       3.843ms
conv3 + conv3/relu input reformatter 0   0.579ms
conv3 + conv3/relu                       2.607ms
conv4 + conv4/relu                       1.840ms
conv5 + conv5/relu                       2.981ms
conv6 + conv6/relu                       1.981ms
conv7 + conv7/relu                       4.091ms
conv8 + conv8/relu                       4.027ms
conv9 + conv9/relu                       4.036ms
conv10 + conv10/relu                     4.190ms
conv11 + conv11/relu                     3.789ms
conv12 + conv12/relu                     2.699ms
conv13 + conv13/relu                     5.152ms
conv15 + conv15/relu                     5.047ms
upsample_1 input reformatter 0           0.068ms
upsample_1                               0.835ms
upsample_1 output reformatter 0          0.368ms
conv15_e + conv15_e/relu                 0.590ms
conv17 + conv17/relu                     3.504ms
conv17 + conv17/relu output reformatter  0.137ms
conv18 + conv18/relu                     6.454ms
conv18 + conv18/relu output reformatter  0.176ms
upsample_2                               1.372ms
conv18_e + conv18_e/relu                 0.782ms
conv18_e + conv18_e/relu output reformat 0.134ms
conv21 + conv21/relu                     3.268ms
conv20 + conv20/relu                     3.077ms
conv19 + conv19/relu                     3.892ms
conv22_1cls                              0.088ms
conv22_1cls output reformatter 0         0.005ms
conv23_1cls                              0.126ms
conv24_1cls input reformatter 0          0.234ms
conv24_1cls                              0.193ms
Time over all layers: 75.782

I used exactly the same Caffe model in Caffe and TRT, but why is there so much difference in performance (speed, accuracy)? Hope you could provide some advice. Thanks so much.

Topic		Replies	Views
Convert tensorrt engine from version 7 to 8 TAO Toolkit tensorrt	67	4370	October 12, 2021
Tensorrt fp32 inference slower than pytorch on tesla T4 for groundingDINO TensorRT cudnn	1	565	January 22, 2024
TensorRT Inference error on Jetson nano Jetson Nano tensorrt	28	2907	February 1, 2022
TensorRT (TF-TRT) doesn't improve TF model in GeForce 1060? TensorRT	7	2912	January 18, 2019
Tensorrt inference with batch > 1 TensorRT	4	1392	October 13, 2022
Different TensorRT inference results from the same input when batchSize > 1 TensorRT	2	2028	October 12, 2021
Trying to run TensorFlow 1.15 produced graphdefs with TF2 based tensorRT but TensorRT model is not building correctly TensorRT tensorrt , tensorflow , python , inference-server-triton , machine-learning	4	951	May 13, 2021
Inference Speed Jetson Xavier NX pytorch	6	888	April 12, 2023
Falure to do inference TAO Toolkit tensorrt	9	1071	January 11, 2022
Extremely slow inference in TensorRT for live semantic segmentation model Jetson AGX Xavier tensorrt , tensorflow , jetson-inference	11	4389	April 12, 2022

Inference Time is not stable

Related topics