DLA , faster rcnn model error

Hi, I have installed it successfully.

Ubuntu 18.04 LTS aarch64
CUDA 10.0.326
cuDNN 7.5.0.66
TensorRT 5.1.6.1
Xavier

I run my caffe faster-rcnn model on DLA with TensorRT 5.1. I have set gpu fallback. I have set FP16.But the error occurs, how do I fix it?

[W] [TRT] Default DLA is enabled but layer proposal is not running on DLA, falling back to GPU.
...
[W] [TRT] Default DLA is enabled but layer cls_prob is not running on DLA, falling back to GPU.
[W] [TRT] DLA Node compilation Failed.
[E] [TRT] Internal error: could not find any implementation for node{conv1, relu1, conv2, relu2, ......, concat, convf, reluf, rpn_conv1}, try increasing the workspace size with IBuilder::setMaxWorkspaceSize()
[E] [TRT] ../builder/tacticOptimizer.cpp(1330) - OutOfMemory Error in computeCosts:0

...Assertion 'engine' failed.
Aborted(core dumped)

I have set batchsize=1, maxworkspacesize=1000<<21, my model is 160*160, but also the same error.

When I was in TensorRT 5.0, this code has the error same as the url below described.

../builder/cudnnBuilder2.cpp (728) - Misc Error in buildSingleLayer: 1 (Unable to process layer.)
../builder/cudnnBuilder2.cpp (728) - Misc Error in buildSingleLayer: 1 (Unable to process layer.)

https://devtalk.nvidia.com/default/topic/1045093/jetson-agx-xavier/running-tensorflow-models-on-dla/post/5364451/#5364451

It says that this problem will be fixed in TensorRT 5.1, but now error occurs as I descriped above.What causes it? How to fix it? Thanks.

my code

const int useFP = 16;
static int gUseDLACore{0};

void caffeToGIEModel(const std::string& deployFile,    
                     const std::string& modelFile,     
                     const std::vector<std::string>& outputs,   
                     unsigned int maxBatchSize,      
                     nvcaffeparser1::IPluginFactory* pluginFactory, 
                     IHostMemory **gieModelStream)     
{

    IBuilder* builder = createInferBuilder(gLogger);

INetworkDefinition* network = builder->createNetwork();
    ICaffeParser* parser = createCaffeParser();
    parser->setPluginFactory(pluginFactory);

    DataType dataType = DataType::kFLOAT; // default
    if(useFP == 16)
    {
        if(builder->platformHasFastFp16())
        {
            dataType = DataType::kHALF;
        }
        else
        {
            return;
        }
    }

    std::cout << "Begin parsing model..." << std::endl;
    const IBlobNameToTensor* blobNameToTensor = parser->parse((deployFile).c_str(),
            (modelFile).c_str(),
            *network,
            dataType);
    std::cout << "End parsing model..." << std::endl;

    for (auto& s : outputs)
        network->markOutput(*blobNameToTensor->find(s.c_str()));

builder->setMaxBatchSize(maxBatchSize);
    builder->setMaxWorkspaceSize(1000 << 21); 
    if(gUseDLACore >= 0)
    {
        samplesCommon::enableDLA(builder, gUseDLACore);
    }
    if(useFP == 16)
    {
        builder->setFp16Mode(true);
    }

    std::cout << "Begin building engine..." << std::endl;
    ICudaEngine* engine = builder->buildCudaEngine(*network);
    assert(engine);
    std::cout << "End building engine..." << std::endl;

network->destroy();
    parser->destroy();

(*gieModelStream) = engine->serialize();
    
    engine->destroy();
    builder->destroy();
    shutdownProtobufLibrary();
}

Hi,

We want to reproduce this issue on our environment.
Could you also share your model with us?

Thanks.

Hi,

Thank you for your reply.

Could you please tell me a email where I will send the prototxt file?

Thanks

Although the TensorRT I’m using now is 5, the way I implement plugins is the sample used by tensorrt 4. Like this.

class PluginFactory : public nvinfer1::IPluginFactory, public nvcaffeparser1::IPluginFactory
{
    ...
}

class Reshape : public IPlugin
{
    ...
}

Is this likely to lead to my mistakes?

Because I use FP16, so the plugin here

mPluginRPROI = std::unique_ptr<INvPlugin, decltype(nvPluginDeleter)>
                           (createFasterRCNNPlugin(featureStride, preNmsTop, nmsMaxOut, iouThreshold, minBoxSize, spatialScale,
                                                   DimsHW(poolingH, poolingW), Weights{ dataType, anchorsRatios, anchorsRatioCount },
                                                   Weights{ dataType, anchorsScales, anchorsScaleCount }), nvPluginDeleter);

the dataType should be kHALF not kFLOAT, right?

the data which is not weights, such as this, it should be sizeof(float) not sizeof(float)/2, right?

int enqueue(int batchSize, const void*const *inputs, void** outputs, void*, cudaStream_t stream) override
    {
        CHECK(cudaMemcpyAsync(outputs[0], inputs[0], mCopySize * batchSize * sizeof(float), cudaMemcpyDeviceToDevice, stream));
        return 0;
    }

I ran another very similar model to this one . It works. The info is as follows.

[W] [TRT] Default DLA is enabled but layer rpn_cls_score_reshape is not running on DLA, falling back to GPU.
[W] [TRT] Default DLA is enabled but layer rpn_cls_prob is not running on DLA, falling back to GPU.
[W] [TRT] Default DLA is enabled but layer rpn_cls_prob_reshape is not running on DLA, falling back to GPU.
[W] [TRT] Default DLA is enabled but layer proposal is not running on DLA, falling back to GPU.
[W] [TRT] DLA LAYER: Batch size(combined volume except for CHW dimensions) 300 for layer fc6_L exceeds max batch size allowed of 32.
[W] [TRT] Default DLA is enabled but layer fc6_L is not running on DLA, falling back to GPU.
...
[W] [TRT] Internal DLA error for layer upsample. Switching to GPU fallback.
[W] [TRT] Internal DLA error for layer upsample. Switching to GPU fallback.
[W] [TRT] Warning: no implementation of rpn_cls_score_reshape obeys the requested constraints, using a higher precision type
[W] [TRT] Warning: no implementation of rpn_cls_prob_reshape obeys the requested constraints, using a higher precision type
[W] [TRT] Warning: no implementation of proposal obeys the requested constraints, using a higher precision type
...
Success!

Coincidentally, this three layer rpn_cls_score_reshape, rpn_cls_prob_reshape,proposal are my all plugin layers. For the
warning “Warning: no implementation of…” , is there any releation?

my upsample layer

layer {
  name: "upsample"
  type: "Deconvolution"
  bottom: "inc4e"
  top: "upsample"
  param { lr_mult: 0  decay_mult: 0 }
  convolution_param {
    num_output: 256
    kernel_size: 4  stride: 2  pad: 1
    group: 256
    weight_filler: { type: "bilinear" }
    bias_term: false
  }
}

It doesn’t work on DLA which is because of kernel_size and pad. But I’m wondering why it doesn’t fall back directly, but prompt ‘Internal DLA error’.

.

Hi,

Could you share the difference in model architecture between the fail and success one?

Based on the log shared in #1, the application may crash since out of resource:

[E] [TRT] ../builder/tacticOptimizer.cpp(1330) - OutOfMemory Error in computeCosts:0

Would you mind to monitor the system with tegrastats at the same time and share the log with us?

sudo tegrastats

Thanks.