TensorRT 2.1 OutOfMemory Error in buildSingleLayer

Hey Nvidia,

I’m looking for some help with TensorRT.

I receive the following error when passing a deployed (.prototxt) file to the giexec tool (I also encounter the same error in my own implementation of the code where TensorRT is integrated with the API).

Internal error: could not find any implementation for node `global_pool`, try increasing the workspace size with IBuilder::setMaxWorkspaceSize()
cudnnBuilder2.cpp (586) - OutOfMemory Error in buildSingleLayer

In net.prototxt, global_pool is implemented as follows

layer {
  name: "global_pool"
  type: "Pooling"
  bottom: "concat"
  top: "global_pool"
  pooling_param {
     pool: AVE
     kernel_size: 32
     stride: 32
     pad: 0

What I’ve done to address this error:

  • Increasing the workspace size (code block below). I've tried everything for n ranging from 1, 20; for N, I've done the same (ranging from 1, 20 ... though, there is a maximum available memory for temporary operations on the GPU).
    builder->setMaxWorkspaceSize(n << N);
  • Using identical project files on laptop/desktop with TensorRT 2.1 installed ... inference works perfectly.
  • I’m not sure where to go from here. Looking for some advice.

    Matthew J

    Not sure at all it will help, but if you have a SD Card or some disk, you may add swap and give it a try.


    Could you check how many memory is used with TensorRT on the desktop version?

    Hey AastaLLL,

    I should add that the code is the same on both the Tegra and the Desktop. That being said, the GPU memory usage on the Desktop does not exceed 340 MiB according to nvidia-smi.

    I should add that I’m trying to implement the network in 16 bit floating point. Everything works as expected in 32 bit floating point. Both 16 and 32 bit modes work fine on the Desktop/Laptop … only 32 bit works on the Tegra. I forgot to add this important point in my original post.



    What is your batch size? Could you lower the batch size, and try it again?

    Hi mjones,
    Did you solve this issue? I met the same problem with you.


    There are two suggestions about this issue:

    1. Decrease batch size

    2. Increase workspace size
    Please check this page for more information:


    I hit the same issue when building FP16 model for Tesla P100. Are you sure this is related to workspace size? Why does building FP16 need more memory than building the same FP32 counterpart?

    I kept increasing the workspace to 16 GB and then I hit gieCudaMalloc failures

    Total Activation Memory: 17213442048
    resources.cpp (57) - Cuda Error in gieCudaMalloc: 2


    Could you share your model file?
    We want to reproduce this issue on our side and give a further suggestion.


    It was a bug in my code. I was building FP16 model by calling C++ IBuilder::setHalf2Mode(true), but I still set weights as DataType::DT_kFLOAT somewhere in my code. After I converted weights to FP16 and set the type to DT_kHALF, I could successfully build the model.

    We should fix the error message.

    Thanks for your feedback.