Memory Issue with Half2Mode in TensorRT 3

Hi
I am trying to optimize a Customized VGG16 model (prototxt file attached below).
With FP32 datatype I am able to achieve inference time ~105ms.

However, when I try to use FP16 and Half2Mode with same model, I get following error
Begin building engine…
ERROR: Internal error: could not find any implementation for node fc6 + relu6, try increasing the workspace size with IBuilder::setMaxWorkspaceSize()
ERROR: cudnnBuilder2.cpp (452) - OutOfMemory Error in buildSingleLayer
End building engine…
Segmentation fault (core dumped)

I have tried a variety of WorkspaceSizes ranging from 16 << 20(used for FP32, which built successfully) to 3UL << 30 (above this,e.g at 4UL << 30, error changes to ERROR: resources.cpp (199) - Cuda Error in gieCudaMalloc: 2)
But the error is still there.
However, parsing network till pool5 layer only, resolves the error and model is successfully optimized with FP16 and Half2Mode.

Any Directions on this would be really apppreciated ??

*seems .prototxt files are not supported in attachments so changing file format from .prototxt to .txt
network.txt (7.35 KB)

Hi,

Thanks for your reporting.

We can reproduce this error in our environment and has feedback it to internal team.
Will update information to you later.

Thanks.

Alright waiting for the update …

Hi all,

I confirm.
We have the same problem.
We had to disable FP16 on TX2.

Do you have an idea of the fix release ?
Because we have to update at a client. :/

Thanks !
François

Hi, this maybe cause of dilation in your prototxt. When I use FP16 and INT8 with SSD, this issue will come, it can solve by remove dilation from your network and retrain. This may solve your problems.

Hi,
Thanks for the answer.
We dont use dilation. Its on the padding that we have problem.
We set padding on conv1.

(actually dont know if you were talking to me or Wahaj :) )

Thx again
François

Hi, francoisBilberry

Thanks for updating this issue with us.

We are still checking this problem with our internal team.
Will update information with you if any progress.

Thanks.

Hi,

This issue is fixed in TensorRT 3.0.4.

If you are using an x86 Linux machine, please update your package with our lastest release here.
If you are using Jetson platform, please wait for our next JetPack release.

Thanks.

Oh Thank you @AastaLLL for the update …

Hi,

Thank you,

François

Hi,

TensorRT 3.0.4 for Jetson is available in JetPack 3.2 GA.

Thanks.

Thanks again @AastaLLL

Well I met the same issue on TensorRT 5.0.0.10

[TensorRT] ERROR: Internal error: could not find any implementation for node 2-layer MLP, try increasing the workspace size with IBuilder::setMaxWorkspaceSize()
[TensorRT] ERROR: …/builder/tacticOptimizer.cpp (1228) - OutOfMemory Error in computeCosts: 0

I had decreased the batchsize to 1 and increased the workspace size to 1<<30 which will get an error.

I am just running the sample of “end_to_end_tensorflow_mnist”.This sample works fine when I run it in the terminal but when I copy the code to pycharm and add the workspace and batchsize it failed.

Does someone know how to solve this issue?

Thanks.

Duplicate to topic 1043200.
Please check this comment for the answer.

Thanks.

similar error.

@AastaLLL
but I am in nv tesla v100 gpu,
and my envir:

TensorRT 5.0.2.6
centos 7.4.1708
Cuda 9.0
Cudnn 7.3.1

if I set builder->setMaxWorkspaceSize(1 << 20), it works, but when I set builder->setMaxWorkspaceSize(1 << 32), it shows the error:

ERROR: Internal error: could not find any implementation for node 2-layer MLP, try increasing the workspace size with IBuilder::setMaxWorkspaceSize()
ERROR: ../builder/tacticOptimizer.cpp (1230) - OutOfMemory Error in computeCosts: 0

my code is as follow:

IBuilder* builder = createInferBuilder(gLogger);
    INetworkDefinition* network = builder->createNetwork();

    ICaffeParser* parser = createCaffeParser();
    // parser->setPluginFactory(&pluginFactory);

    bool mEnableFp16 = builder->platformHasFastFp16();
    bool mEnableInt8 = builder->platformHasFastInt8();
    printf(LOG_GIE "platform %s Fp16 support.\n", mEnableFp16 ? "has" : "does not have");
    printf(LOG_GIE "platform %s Int8 support.\n", mEnableInt8 ? "has" : "does not have");

    DataType modelDataType = mEnableFp16 ? DataType::kHALF : DataType::kFLOAT;
    // DataType modelDataType = useInt8 ? DataType::kINT8 : DataType::kFLOAT;

    printf(LOG_GIE "loading %s \n", deployFile.c_str());

    const IBlobNameToTensor *blobNameToTensor =	parser->parse(deployFile.c_str(),
                                                              modelFile.c_str(),
                                                              *network,
                                                              modelDataType);

    assert(blobNameToTensor != nullptr);

    for (int i = 0, n = network->getNbInputs(); i < n; i++)
    {
        Dims3 dims = static_cast<Dims3&&>(network->getInput(i)->getDimensions());
        std::cout << "Input \"" << network->getInput(i)->getName() << "\": " << dims.d[0] << "x" << dims.d[1] << "x" <<
        dims.d[2] << std::endl;
    }

    for (auto& s : outputs) network->markOutput(*blobNameToTensor->find(s.c_str()));

    builder->setMaxBatchSize(maxBatchSize);
    builder->setMaxWorkspaceSize(1 << 32);

    // set up the network for paired-fp16 format
    if(mEnableFp16)builder->setHalf2Mode(true);

    ICudaEngine* engine = builder->buildCudaEngine(*network);
    assert(engine);

    network->destroy();
    parser->destroy();

    gieModelStream = engine->serialize();
    engine->destroy();
    builder->destroy();
    // pluginFactory.destroyPlugin();

    std::ofstream ofs("serialized_engine.trt", std::ios::out | std::ios::binary);
    ofs.write((char*)(gieModelStream->data()), gieModelStream->size());
    ofs.close();
    gieModelStream->destroy();
    shutdownProtobufLibrary();

can you give some advises?

Hi,

1 << 32’ is too large for your GPU.
Please lower the amount to avoid out of memory.

Thanks.