Deep Learning Inference Benchmarking Instructions

Hi @Aref , jetson_benchmarks does include ssd-mobilenet-v1, which is pretty similar. Alas, I have found the following patch for sampleUffSSD.cpp and the Makefile get the benchmark running again under JetPack 4.5:

sampleUffSSD.cpp patch

20,21d19
< using namespace sample;
< using namespace std;
23c21
< /*static Logger gLogger;*/
---
> static Logger gLogger;
171c169
<     builder->setMaxWorkspaceSize(1024 * 1024 * 128); // We need about 1GB of scratch space for the plugin layer for batch size 5.
---
>     builder->setMaxWorkspaceSize(128_MB); // We need about 1GB of scratch space for the plugin layer for batch size 5.

Makefile patch

3d2
< EXTRA_DIRECTORIES = ../common

I will update the first post in the topic with this new info - thanks.

1 Like

hi ,
how do i do the step number 2 , where i have to patch file under SSD-Mobilenet-V2 section

Hi @arvindh, I recommend to manually apply those changes in the patch to the file, or you can use the Linux patch command.

UPDATE AFTER FOLLOWING THE PATCHES
https://forums.developer.nvidia.com/t/deep-learning-inference-benchmarking-instructions/73291/88?u=rajeshroy402
@dusty_nv
Sorry but it didn’t help. I’m still getting the same errors.
@gshartnett did it resolved for you?

My hardware: Jetson Nano 4GB Dev Kit
Jetpack: Jetpack 4.4.1
L4T 32.4.4
My errors are here:
Current/New errors

../Makefile.config:10: CUDA_INSTALL_DIR variable is not specified, using /usr/local/cuda by default, use CUDA_INSTALL_DIR=<cuda_directory> to change.
../Makefile.config:15: CUDNN_INSTALL_DIR variable is not specified, using /usr/local/cuda by default, use CUDNN_INSTALL_DIR=<cudnn_directory> to change.
../Makefile.config:28: TRT_LIB_DIR is not specified, searching ../../lib, ../../lib, ../lib by default, use TRT_LIB_DIR=<trt_lib_directory> to change.
if [ ! -d ../../bin/chobj/../common ]; then mkdir -p ../../bin/dchobj/../common; fi; :
Compiling: sampleUffSSD.cpp
sampleUffSSD.cpp: In function ‘nvinfer1::ICudaEngine* loadModelAndCreateEngine(const char*, int, nvuffparser::IUffParser*, nvinfer1::IHostMemory*&)’:
sampleUffSSD.cpp:142:44: error: reference to ‘gLogger’ is ambiguous
     IBuilder* builder = createInferBuilder(gLogger);
                                            ^~~~~~~
sampleUffSSD.cpp:23:15: note: candidates are: sample::Logger gLogger
 static Logger gLogger;
               ^~~~~~~
In file included from ../common/common.h:32:0,
                 from BatchStreamPPM.h:9,
                 from sampleUffSSD.cpp:12:
../common/logger.h:24:19: note:                 sample::Logger sample::gLogger
     extern Logger gLogger;
                   ^~~~~~~
sampleUffSSD.cpp:30:9: error: reference to ‘gLogger’ is ambiguous
         gLogger.log(ILogger::Severity::k##severity, error_message.c_str());    \
         ^
sampleUffSSD.cpp:149:9: note: in expansion of macro ‘RETURN_AND_LOG’
         RETURN_AND_LOG(nullptr, INTERNAL_ERROR, "Fail to parse");
         ^~~~~~~~~~~~~~
sampleUffSSD.cpp:23:15: note: candidates are: sample::Logger gLogger
 static Logger gLogger;
               ^~~~~~~
In file included from ../common/common.h:32:0,
                 from BatchStreamPPM.h:9,
                 from sampleUffSSD.cpp:12:
../common/logger.h:24:19: note:                 sample::Logger sample::gLogger
     extern Logger gLogger;
                   ^~~~~~~
sampleUffSSD.cpp:171:34: error: unable to find numeric literal operator ‘operator""_MB’
     builder->setMaxWorkspaceSize(128_MB); // We need about 1GB of scratch space for the plugin layer for batch size 5.
                                  ^~~~~~
sampleUffSSD.cpp:171:34: note: use -std=gnu++11 or -fext-numeric-literals to enable more built-in suffixes
sampleUffSSD.cpp:30:9: error: reference to ‘gLogger’ is ambiguous
         gLogger.log(ILogger::Severity::k##severity, error_message.c_str());    \
         ^
sampleUffSSD.cpp:186:9: note: in expansion of macro ‘RETURN_AND_LOG’
         RETURN_AND_LOG(nullptr, INTERNAL_ERROR, "Unable to create engine");
         ^~~~~~~~~~~~~~
sampleUffSSD.cpp:23:15: note: candidates are: sample::Logger gLogger
 static Logger gLogger;
               ^~~~~~~
In file included from ../common/common.h:32:0,
                 from BatchStreamPPM.h:9,
                 from sampleUffSSD.cpp:12:
../common/logger.h:24:19: note:                 sample::Logger sample::gLogger
     extern Logger gLogger;
                   ^~~~~~~
sampleUffSSD.cpp: In function ‘int main(int, char**)’:
sampleUffSSD.cpp:561:28: error: reference to ‘gLogger’ is ambiguous
     initLibNvInferPlugins(&gLogger, "");
                            ^~~~~~~
sampleUffSSD.cpp:23:15: note: candidates are: sample::Logger gLogger
 static Logger gLogger;
               ^~~~~~~
In file included from ../common/common.h:32:0,
                 from BatchStreamPPM.h:9,
                 from sampleUffSSD.cpp:12:
../common/logger.h:24:19: note:                 sample::Logger sample::gLogger
     extern Logger gLogger;
                   ^~~~~~~
sampleUffSSD.cpp:630:44: error: reference to ‘gLogger’ is ambiguous
     IRuntime* runtime = createInferRuntime(gLogger);
                                            ^~~~~~~
sampleUffSSD.cpp:23:15: note: candidates are: sample::Logger gLogger
 static Logger gLogger;
               ^~~~~~~
In file included from ../common/common.h:32:0,
                 from BatchStreamPPM.h:9,
                 from sampleUffSSD.cpp:12:
../common/logger.h:24:19: note:                 sample::Logger sample::gLogger
     extern Logger gLogger;
                   ^~~~~~~
../Makefile.config:312: recipe for target '../../bin/dchobj/sampleUffSSD.o' failed
make: *** [../../bin/dchobj/sampleUffSSD.o] Error 1

Hi @rajeshroy402, please see the patches from the first post of this topic, under the JetPack 4.4 or newer drop-down.

Yes, I have followed the patches instruction. Still it’s not working.

OK, I just tried this again on JetPack 4.4.1 and applied the patch, and it is working here. This is the source of /usr/src/tensorrt/samples/sampleUffSSD_rect/sampleUffSSD.cpp after the patch is applied:

#include <cassert>
#include <chrono>
#include <cublas_v2.h>
#include <cudnn.h>
#include <iostream>
#include <sstream>
#include <string.h>
#include <time.h>
#include <unordered_map>
#include <vector>

#include "BatchStreamPPM.h"
#include "NvUffParser.h"
#include "common.h"
#include "argsParser.h"
#include "NvInferPlugin.h"

using namespace nvinfer1;
using namespace nvuffparser;

using namespace sample;
using namespace std;

/*static Logger gLogger;*/

static samplesCommon::Args args;

#define RETURN_AND_LOG(ret, severity, message)                                 \
    do                                                                         \
    {                                                                          \
        std::string error_message = "sample_uff_ssd: " + std::string(message); \
        gLogger.log(ILogger::Severity::k##severity, error_message.c_str());    \
        return (ret);                                                          \
    } while (0)

static constexpr int OUTPUT_CLS_SIZE = 37;
static constexpr int OUTPUT_BBOX_SIZE = OUTPUT_CLS_SIZE * 4;

const char* OUTPUT_BLOB_NAME0 = "NMS";
const char* CLASSIFIER_CLS_SOFTMAX_NAME = "random_name";
//INT8 Calibration, currently set to calibrate over 500 images
static constexpr int CAL_BATCH_SIZE = 50;
static constexpr int FIRST_CAL_BATCH = 0, NB_CAL_BATCHES = 10;

// Concat layers
// mbox_priorbox, mbox_loc, mbox_conf
const int concatAxis[2] = {1, 1};
const bool ignoreBatch[2] = {false, false};

DetectionOutputParameters detectionOutputParam{true, false, 0, OUTPUT_CLS_SIZE, 100, 100, 0.5, 0.6, CodeTypeSSD::TF_CENTER, {1, 2, 0}, true, true};

// Visualization
const float visualizeThreshold = 0.4;

void printOutput(int64_t eltCount, DataType dtype, void* buffer)
{
    std::cout << eltCount << " eltCount" << std::endl;
    assert(samplesCommon::getElementSize(dtype) == sizeof(float));
    std::cout << "--- OUTPUT ---" << std::endl;

    size_t memSize = eltCount * samplesCommon::getElementSize(dtype);
    float* outputs = new float[eltCount];
    CHECK(cudaMemcpyAsync(outputs, buffer, memSize, cudaMemcpyDeviceToHost));

    int maxIdx = std::distance(outputs, std::max_element(outputs, outputs + eltCount));

    for (int64_t eltIdx = 0; eltIdx < eltCount; ++eltIdx)
    {
        std::cout << eltIdx << " => " << outputs[eltIdx] << "\t : ";
        if (eltIdx == maxIdx)
            std::cout << "***";
        std::cout << "\n";
    }

    std::cout << std::endl;
    delete[] outputs;
}

std::string locateFile(const std::string& input)
{
    std::vector<std::string> dirs{"data/ssd/",
                                  "data/ssd/VOC2007/",
                                  "data/ssd/VOC2007/PPMImages/",
                                  "data/samples/ssd/",
                                  "data/samples/ssd/VOC2007/",
                                  "data/samples/ssd/VOC2007/PPMImages/"};
    return locateFile(input, dirs);
}

void populateTFInputData(float* data)
{

    auto fileName = locateFile("inp_bus.txt");
    std::ifstream labelFile(fileName);
    string line;
    int id = 0;
    while (getline(labelFile, line))
    {
        istringstream iss(line);
        float num;
        iss >> num;
        data[id++] = num;
    }

    return;
}

void populateClassLabels(std::string (&CLASSES)[OUTPUT_CLS_SIZE])
{

    // auto fileName = locateFile("ssd_coco_labels.txt");
    auto fileName = locateFile("sample_labels.txt");
    std::ifstream labelFile(fileName);
    string line;
    int id = 0;
    while (getline(labelFile, line))
    {
        CLASSES[id++] = line;
    }

    return;
}

std::vector<std::pair<int64_t, DataType>>
calculateBindingBufferSizes(const ICudaEngine& engine, int nbBindings, int batchSize)
{
    std::vector<std::pair<int64_t, DataType>> sizes;
    for (int i = 0; i < nbBindings; ++i)
    {
        Dims dims = engine.getBindingDimensions(i);
        DataType dtype = engine.getBindingDataType(i);

        int64_t eltCount = samplesCommon::volume(dims) * batchSize;
        sizes.push_back(std::make_pair(eltCount, dtype));
    }

    return sizes;
}

ICudaEngine* loadModelAndCreateEngine(const char* uffFile, int maxBatchSize,
                                      IUffParser* parser, IHostMemory*& trtModelStream)
{
    // Create the builder
    IBuilder* builder = createInferBuilder(gLogger);

    // Parse the UFF model to populate the network, then set the outputs.
    INetworkDefinition* network = builder->createNetwork();

    std::cout << "Begin parsing model..." << std::endl;
    if (!parser->parse(uffFile, *network, nvinfer1::DataType::kFLOAT))
        RETURN_AND_LOG(nullptr, INTERNAL_ERROR, "Fail to parse");

    std::cout << "End parsing model..." << std::endl;
    // for (int i = 0; i < network->getNbLayers(); ++i) {
    //     if (! strcmp(network->getLayer(i)->getName(), OUTPUT_BLOB_NAME0)) {
    //         auto top_layer = network->getLayer(i);
    //         cout << "unmark " << OUTPUT_BLOB_NAME0 << endl;
    //         network->unmarkOutput(*top_layer->getOutput(0));
    //     }

    //     if (! strcmp(network->getLayer(i)->getName(), BACKBONE_OUTPUT)) {
    //         auto top_layer = network->getLayer(i);
    //         // auto prob = network->addSoftMax(*top_layer->getOutput(0));
    //         // prob->getOutput(0)->setName(CLASSIFIER_CLS_SOFTMAX_NAME);
    //         // network->markOutput(*prob->getOutput(0));
    //         network->markOutput(*top_layer->getOutput(0));
    //         cout << "mark " << BACKBONE_OUTPUT << endl;
    //     }
    // }
    // Build the engine.
    builder->setMaxBatchSize(maxBatchSize);
    // The _GB literal operator is defined in common/common.h
    builder->setMaxWorkspaceSize(1024*1024*128); // We need about 1GB of scratch space for the plugin layer for batch size 5.
    builder->setHalf2Mode(true);
    //if (args.runInInt8)
    //{
    //    builder->setInt8Mode(true);
    //    builder->setInt8Calibrator(calibrator);
    //}

    std::cout << "Begin building engine..." << std::endl;
    auto t_start = std::chrono::high_resolution_clock::now();
    ICudaEngine* engine = builder->buildCudaEngine(*network);
    auto t_end = std::chrono::high_resolution_clock::now();
    float total = std::chrono::duration<float, std::milli>(t_end - t_start).count();
    std::cout << "Time lapsed to create an engine: " << total << "ms" << std::endl;
    if (!engine)
        RETURN_AND_LOG(nullptr, INTERNAL_ERROR, "Unable to create engine");
    std::cout << "End building engine..." << std::endl;

    // We don't need the network any more, and we can destroy the parser.
    network->destroy();
    parser->destroy();

    // Serialize the engine, then close everything down.
    trtModelStream = engine->serialize();

    builder->destroy();
    shutdownProtobufLibrary();
    return engine;
}

void doInference(IExecutionContext& context, float* inputData, float* detectionOut, int* keepCount, int batchSize)
{
    const ICudaEngine& engine = context.getEngine();
    // Input and output buffer pointers that we pass to the engine - the engine requires exactly IEngine::getNbBindings(),
    // of these, but in this case we know that there is exactly 1 input and 2 output.
    int nbBindings = engine.getNbBindings();
    std::cout << nbBindings << " Binding" << std::endl;
    std::vector<void*> buffers(nbBindings);
    std::vector<std::pair<int64_t, DataType>> buffersSizes = calculateBindingBufferSizes(engine, nbBindings, batchSize);

    for (int i = 0; i < nbBindings; ++i)
    {
        auto bufferSizesOutput = buffersSizes[i];
        buffers[i] = samplesCommon::safeCudaMalloc(bufferSizesOutput.first * samplesCommon::getElementSize(bufferSizesOutput.second));
        std::cout << "Allocating buffer sizes for binding index: " << i << " of size : " ;
        std::cout << bufferSizesOutput.first << " * " << samplesCommon::getElementSize(bufferSizesOutput.second);
        std::cout << " B" << std::endl;
    }

    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings().
    int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME),
        outputIndex0 = engine.getBindingIndex(OUTPUT_BLOB_NAME0),
        outputIndex1 = outputIndex0 + 1; //engine.getBindingIndex(OUTPUT_BLOB_NAME1);

    cudaStream_t stream;
    CHECK(cudaStreamCreate(&stream));

    // DMA the input to the GPU,  execute the batch asynchronously, and DMA it back:
    CHECK(cudaMemcpyAsync(buffers[inputIndex], inputData, batchSize * INPUT_C * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));

    float t_average = 0;
    float t_iter = 0;
    int iterations = 10;
    int avgRuns = 100;

    for (int i=0; i < iterations; i++)
    {
	t_average = 0;
	for (int j=0; j < avgRuns; j++)
	{
            auto t_start = std::chrono::high_resolution_clock::now();
            context.execute(batchSize, &buffers[0]);
            auto t_end = std::chrono::high_resolution_clock::now();
            float total = std::chrono::duration<float, std::milli>(t_end - t_start).count();
            t_average += total;

	}
    	t_average /= avgRuns;
        std::cout << "Time taken for inference per run is " << t_average << " ms." << std::endl;
	t_iter += t_average;
    }
    t_iter /= iterations;
    std::cout << "Average time spent per iteration is " << t_iter << " ms." << std::endl;

    for (int bindingIdx = 0; bindingIdx < nbBindings; ++bindingIdx)
    {
        if (engine.bindingIsInput(bindingIdx))
            continue;
#ifdef SSD_INT8_DEBUG
        auto bufferSizesOutput = buffersSizes[bindingIdx];
        printOutput(bufferSizesOutput.first, bufferSizesOutput.second,
                    buffers[bindingIdx]);
#endif
    }

    CHECK(cudaMemcpyAsync(detectionOut, buffers[outputIndex0], batchSize * detectionOutputParam.keepTopK * 7 * sizeof(float), cudaMemcpyDeviceToHost, stream));
    CHECK(cudaMemcpyAsync(keepCount, buffers[outputIndex1], batchSize * sizeof(int), cudaMemcpyDeviceToHost, stream));
    cudaStreamSynchronize(stream);
    std::cout << "Time taken for inference is " << t_average << " ms." << std::endl;
    // Release the stream and the buffers
    cudaStreamDestroy(stream);
    CHECK(cudaFree(buffers[inputIndex]));
    CHECK(cudaFree(buffers[outputIndex0]));
    CHECK(cudaFree(buffers[outputIndex1]));
}

//////////////////////////////////////////////////////////////////////////////////////////////////////
namespace
{
const char* FLATTENCONCAT_PLUGIN_VERSION{"1"};
const char* FLATTENCONCAT_PLUGIN_NAME{"FlattenConcat_TRT"};
}

// Flattens all input tensors and concats their flattened version together
// along the major non-batch dimension, i.e axis = 1
class FlattenConcat : public IPluginV2
{
public:
    // Ordinary ctor, plugin not yet configured for particular inputs/output
    FlattenConcat() {}

    // Ctor for clone()
    FlattenConcat(const int* flattenedInputSize, int numInputs, int flattenedOutputSize)
        : mFlattenedOutputSize(flattenedOutputSize)
    {
        for (int i = 0; i < numInputs; ++i)
            mFlattenedInputSize.push_back(flattenedInputSize[i]);
    }

    // Ctor for loading from serialized byte array
    FlattenConcat(const void* data, size_t length)
    {
        const char* d = reinterpret_cast<const char*>(data);
        const char* a = d;

        size_t numInputs = read<size_t>(d);
        for (size_t i = 0; i < numInputs; ++i)
        {
            mFlattenedInputSize.push_back(read<int>(d));
        }
        mFlattenedOutputSize = read<int>(d);

        assert(d == a + length);
    }

    int getNbOutputs() const override
    {
        // We always return one output
        return 1;
    }

    Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) override
    {
        // At least one input
        assert(nbInputDims >= 1);
        // We only have one output, so it doesn't
        // make sense to check index != 0
        assert(index == 0);

        size_t flattenedOutputSize = 0;
        int inputVolume = 0;

        for (int i = 0; i < nbInputDims; ++i)
        {
            // We only support NCHW. And inputs Dims are without batch num.
            assert(inputs[i].nbDims == 3);

            inputVolume = inputs[i].d[0] * inputs[i].d[1] * inputs[i].d[2];
            flattenedOutputSize += inputVolume;
        }

        return DimsCHW(flattenedOutputSize, 1, 1);
    }

    int initialize() override
    {
        // Called on engine initialization, we initialize cuBLAS library here,
        // since we'll be using it for inference
        CHECK(cublasCreate(&mCublas));
        return 0;
    }

    void terminate() override
    {
        // Called on engine destruction, we destroy cuBLAS data structures,
        // which were created in initialize()
        CHECK(cublasDestroy(mCublas));
    }

    size_t getWorkspaceSize(int maxBatchSize) const override
    {
        // The operation is done in place, it doesn't use GPU memory
        return 0;
    }

    int enqueue(int batchSize, const void* const* inputs, void** outputs, void*, cudaStream_t stream) override
    {
        // Does the actual concat of inputs, which is just
        // copying all inputs bytes to output byte array
        size_t inputOffset = 0;
        float* output = reinterpret_cast<float*>(outputs[0]);

        for (size_t i = 0; i < mFlattenedInputSize.size(); ++i)
        {
            const float* input = reinterpret_cast<const float*>(inputs[i]);
            for (int batchIdx = 0; batchIdx < batchSize; ++batchIdx)
            {
                CHECK(cublasScopy(mCublas, mFlattenedInputSize[i],
                                  input + batchIdx * mFlattenedInputSize[i], 1,
                                  output + (batchIdx * mFlattenedOutputSize + inputOffset), 1));
            }
            inputOffset += mFlattenedInputSize[i];
        }

        return 0;
    }

    size_t getSerializationSize() const override
    {
        // Returns FlattenConcat plugin serialization size
        size_t size = sizeof(mFlattenedInputSize[0]) * mFlattenedInputSize.size()
            + sizeof(mFlattenedOutputSize)
            + sizeof(size_t); // For serializing mFlattenedInputSize vector size
        return size;
    }

    void serialize(void* buffer) const override
    {
        // Serializes FlattenConcat plugin into byte array

        // Cast buffer to char* and save its beginning to a,
        // (since value of d will be changed during write)
        char* d = reinterpret_cast<char*>(buffer);
        char* a = d;

        size_t numInputs = mFlattenedInputSize.size();

        // Write FlattenConcat fields into buffer
        write(d, numInputs);
        for (size_t i = 0; i < numInputs; ++i)
        {
            write(d, mFlattenedInputSize[i]);
        }
        write(d, mFlattenedOutputSize);

        // Sanity check - checks if d is offset
        // from a by exactly the size of serialized plugin
        assert(d == a + getSerializationSize());
    }

    void configureWithFormat(const Dims* inputs, int nbInputs, const Dims* outputDims, int nbOutputs, nvinfer1::DataType type, nvinfer1::PluginFormat format, int maxBatchSize) override
    {
        // We only support one output
        assert(nbOutputs == 1);

        // Reset plugin private data structures
        mFlattenedInputSize.clear();
        mFlattenedOutputSize = 0;

        // For each input we save its size, we also validate it
        for (int i = 0; i < nbInputs; ++i)
        {
            int inputVolume = 0;

            // We only support NCHW. And inputs Dims are without batch num.
            assert(inputs[i].nbDims == 3);

            // All inputs dimensions along non concat axis should be same
            for (size_t dim = 1; dim < 3; dim++)
            {
                assert(inputs[i].d[dim] == inputs[0].d[dim]);
            }

            // Size of flattened input
            inputVolume = inputs[i].d[0] * inputs[i].d[1] * inputs[i].d[2];
            mFlattenedInputSize.push_back(inputVolume);
            mFlattenedOutputSize += mFlattenedInputSize[i];
        }
    }

    bool supportsFormat(DataType type, PluginFormat format) const override
    {
        return (type == DataType::kFLOAT && format == PluginFormat::kNCHW);
    }

    const char* getPluginType() const override { return FLATTENCONCAT_PLUGIN_NAME; }

    const char* getPluginVersion() const override { return FLATTENCONCAT_PLUGIN_VERSION; }

    void destroy() override {}

    IPluginV2* clone() const override
    {
        return new FlattenConcat(mFlattenedInputSize.data(), mFlattenedInputSize.size(), mFlattenedOutputSize);
    }

    void setPluginNamespace(const char* pluginNamespace) override
    {
        mPluginNamespace = pluginNamespace;
    }

    const char* getPluginNamespace() const override
    {
        return mPluginNamespace.c_str();
    }

private:
    template <typename T>
    void write(char*& buffer, const T& val) const
    {
        *reinterpret_cast<T*>(buffer) = val;
        buffer += sizeof(T);
    }

    template <typename T>
    T read(const char*& buffer)
    {
        T val = *reinterpret_cast<const T*>(buffer);
        buffer += sizeof(T);
        return val;
    }

    // Number of elements in each plugin input, flattened
    std::vector<int> mFlattenedInputSize;
    // Number of elements in output, flattened
    int mFlattenedOutputSize{0};
    // cuBLAS library handle
    cublasHandle_t mCublas;
    // We're not using TensorRT namespaces in
    // this sample, so it's just an empty string
    std::string mPluginNamespace = "";
};

// PluginCreator boilerplate code for FlattenConcat plugin
class FlattenConcatPluginCreator : public IPluginCreator
{
public:
    FlattenConcatPluginCreator()
    {
        mFC.nbFields = 0;
        mFC.fields = 0;
    }

    ~FlattenConcatPluginCreator() {}

    const char* getPluginName() const override { return FLATTENCONCAT_PLUGIN_NAME; }

    const char* getPluginVersion() const override { return FLATTENCONCAT_PLUGIN_VERSION; }

    const PluginFieldCollection* getFieldNames() override { return &mFC; }

    IPluginV2* createPlugin(const char* name, const PluginFieldCollection* fc) override
    {
        return new FlattenConcat();
    }

    IPluginV2* deserializePlugin(const char* name, const void* serialData, size_t serialLength) override
    {

        return new FlattenConcat(serialData, serialLength);
    }

    void setPluginNamespace(const char* pluginNamespace) override
    {
        mPluginNamespace = pluginNamespace;
    }

    const char* getPluginNamespace() const override
    {
        return mPluginNamespace.c_str();
    }

private:
    static PluginFieldCollection mFC;
    static std::vector<PluginField> mPluginAttributes;
    std::string mPluginNamespace = "";
};

PluginFieldCollection FlattenConcatPluginCreator::mFC{};
std::vector<PluginField> FlattenConcatPluginCreator::mPluginAttributes;

REGISTER_TENSORRT_PLUGIN(FlattenConcatPluginCreator);


int main(int argc, char* argv[])
{
    // Parse command-line arguments.
    samplesCommon::parseArgs(args, argc, argv);

    initLibNvInferPlugins(&gLogger, "");
    auto fileName = locateFile("sample_unpruned_mobilenet_v2.uff");
    std::cout << fileName << std::endl;

    // changed BATCH SIZE back to 2
    const int N = 1;
    //const int N = 2;
    
    auto parser = createUffParser();

    // BatchStream calibrationStream(CAL_BATCH_SIZE, NB_CAL_BATCHES);

    std::cout << "Registering UFF model" << std::endl;
    parser->registerInput("Input", DimsCHW(INPUT_C, INPUT_H, INPUT_W), UffInputOrder::kNCHW);
    std::cout << "Registered Input" << std::endl;
    // parser->registerOutput("ssd_resnet18keras_feature_extractor/model/activation_16/Relu");
    parser->registerOutput("NMS");
    std::cout << "Registered output NMS" << std::endl;

    IHostMemory* trtModelStream{nullptr};

    // Int8EntropyCalibrator calibrator(calibrationStream, FIRST_CAL_BATCH, "CalibrationTableSSD");

    std::cout << "Creating engine" << std::endl;
    ICudaEngine* tmpEngine = loadModelAndCreateEngine(fileName.c_str(), N, parser, trtModelStream);
    assert(tmpEngine != nullptr);
    assert(trtModelStream != nullptr);
    tmpEngine->destroy();
    std::cout << "Created engine" << std::endl;
    // Read a random sample image.
    srand(unsigned(time(nullptr)));
    // Available images.
    std::vector<std::string> imageList = {"dog.ppm", "image1.ppm"};
    std::vector<samplesCommon::PPM<INPUT_C, INPUT_H, INPUT_W>> ppms(N);

    assert(ppms.size() <= imageList.size());
    std::cout << " Num batches  " << N << std::endl;
    for (int i = 0; i < N; ++i)
    {
        readPPMFile(locateFile(imageList[i%2]), ppms[i]);
    }

    vector<float> data(N * INPUT_C * INPUT_H * INPUT_W);

    // for (int i = 0, volImg = INPUT_C * INPUT_H * INPUT_W; i < N; ++i)
    // {
    //     for (int c = 0; c < INPUT_C; ++c)
    //     {
    //         for (unsigned j = 0, volChl = INPUT_H * INPUT_W; j < volChl; ++j)
    //         {
    //             data[i * volImg + c * volChl + j] = (2.0 / 255.0) * float(ppms[i].buffer[j * INPUT_C + c]) - 1.0;
    //         }
    //     }
    // }

    // RGB preprocessing based on keras pipeline
    for (int i = 0, volImg = INPUT_C * INPUT_H * INPUT_W; i < N; ++i)
    {
      for (unsigned j = 0, volChl = INPUT_H * INPUT_W; j < volChl; ++j)
      {
        data[i * volImg + 0 * volChl + j] = float(ppms[i].buffer[j * INPUT_C + 0]) - 123.68;
        data[i * volImg + 1 * volChl + j] = float(ppms[i].buffer[j * INPUT_C + 1]) - 116.779;
        data[i * volImg + 2 * volChl + j] = float(ppms[i].buffer[j * INPUT_C + 2]) - 103.939;
      }
    }
    std::cout << " Data Size  " << data.size() << std::endl;

    // Deserialize the engine.
    std::cout << "*** deserializing" << std::endl;
    IRuntime* runtime = createInferRuntime(gLogger);
    assert(runtime != nullptr);
    ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream->data(), trtModelStream->size(), nullptr);
    assert(engine != nullptr);
    trtModelStream->destroy();
    IExecutionContext* context = engine->createExecutionContext();
    assert(context != nullptr);
    //SimpleProfiler profiler("layerTime");
    //context->setProfiler(&profiler);
    // Host memory for outputs.
    vector<float> detectionOut(100000); //(N * detectionOutputParam.keepTopK * 7);
    vector<int> keepCount(N);

    // Run inference.
    doInference(*context, &data[0], &detectionOut[0], &keepCount[0], N);
    cout << " KeepCount " << keepCount[0] << "\n";
    //cout << profiler;
    /***********************************************************************************
    std::string CLASSES[OUTPUT_CLS_SIZE];

    populateClassLabels(CLASSES);

    for (int p = 0; p < N; ++p)
    {
        for (int i = 0; i < keepCount[p]; ++i)
        {
            float* det = &detectionOut[0] + (p * detectionOutputParam.keepTopK + i) * 7;
            if (det[2] < visualizeThreshold)
                continue;

            // Output format for each detection is stored in the below order
            // [image_id, label, confidence, xmin, ymin, xmax, ymax]
            assert((int) det[1] < OUTPUT_CLS_SIZE);
            std::string storeName = CLASSES[(int) det[1]] + "-" + std::to_string(det[2]) + ".ppm";

            printf("Detected %s in the image %d (%s) with confidence %f%% and coordinates (%f,%f),(%f,%f).\nResult stored in %s.\n", CLASSES[(int) det[1]].c_str(), int(det[0]), ppms[p].fileName.c_str(), det[2] * 100.f, det[3] * INPUT_W, det[4] * INPUT_H, det[5] * INPUT_W, det[6] * INPUT_H, storeName.c_str());

            samplesCommon::writePPMFileWithBBox(storeName, ppms[p], {det[3] * INPUT_W, det[4] * INPUT_H, det[5] * INPUT_W, det[6] * INPUT_H});
        }
    }
    ************************************************************************************/
    // Destroy the engine.
    context->destroy();
    engine->destroy();
    runtime->destroy();

    return EXIT_SUCCESS;
}
1 Like

Thank you.

Hello,

just a quick question - is the pose estimation model the same as in: jetson-inference/posenet.md at master · dusty-nv/jetson-inference · GitHub

Thank you.

@user156593 those numbers are using OpenPose model - I believe these benchmarks are from before there was poseNet in jetson-inference (which uses the models from trt_pose)

@dusty_nv Thank you for your answer! So if I understand, the easiest one to employ is the poseNet, because it is included in the jetson-inference library. Do you maybe know if any newer version of pose detection is optimized for jetson nano? We are currently using trt_pose (within docker using jetson-inference base image), but the accuracy is not really satisfying, especially when estimation is made on darker images. This is actually the reason why I was asking about the open pose, since I saw that it performs well. Maybe I should open a new thread on the forum asking about this.

@user156593 you might be interested in the BodyPoseNet model that you could use with TAO / DeepStream:

I was also able to find some OpenPose + TensorRT projects on GitHub / Google that you could try.

Hello,

this BodyPoseNet looks decent, however, by observing the benchmarking results it can only provides 5 FPS on jetson nano. If possible, I would like to achieve appx. 10 FPS (I know that I am being picky now, sorry for that). If you have found any other pose estimation models that can be deployed using TensorRT on jetson nano it would be much appreciated!

I believe that mediapipe and open-mmlab also have pose estimation models, although I’m unsure what performance those have or if they use TensorRT. Mediapipe I’ve seen some folks here on the forums install to their Jetson, but I don’t know what all that entails these days. The models that I know use TensorRT I listed above, although you may be able to get others working with it.

Ok, will look into that! Hopefully, I will find something promising. Thank you anyways!