Unexpected behavior of TopKLayer

I am confused about the usage of the TopK layer;

  1. As from the documentation "The TopK layer has two outputs of the same dimensions. The first contains data values, the second contains index positions for the values."; https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/topics/classnvinfer1_1_1_i_network_definition.html#a384a409318bf416be3aa4442f2b0ce76 what I experience, though, is that the method getNbOutputs() returns a 1; and in fact calling the method getOuput(index) with index = 1, it results in a segmentation fault;
  2. I would expect the size of the output - may it be data values or index positions - for input of size [BatchSize, NumChannels, Height, Width], to be something like [BatchSize, NumChannels, Height, K], where K would be the number of elements to keep, and the reduction axis would be the Width one; instead, in my experience the output size is the same as the input size;
  3. Trying to execute the code attached below, I obtain the error
    ERROR: topk: input and output must have the same number of dimensions
    topkdebug: topKdbg.cpp:139: void APIToModel(unsigned int, nvinfer1::IHostMemory**): Assertion `engine != nullptr' failed.

    I tried both, defining output size as expected and defining it the same as the input; same problem;

is that the expected behaviour and it is me misunderstanding?
can anyone show me an example of the correct usage of the layer? I know it is used in the sampleCharRNN sample but the code doesn’t execute for me either.

I use TensorRT with CUDA 8.0;

Please find my toy example code below to reproduce the issue;



#include "NvInfer.h"
#include "NvCaffeParser.h"
#include "NvUtils.h"
#include "cuda_runtime_api.h"
#include <cassert>
#include <cmath>
#include <cstring>
#include <string>
#include <fstream>
#include <iostream>
#include <sstream>
#include <sys/stat.h>
#include <vector>
#include <algorithm>

#define CHECK(status)									\
{														\
	if (status != 0)									\
	{													\
		std::cout << "Cuda failure: " << status;		\
		abort();										\
	}													\

// Logger for GIE info/warning/errors
class Logger : public nvinfer1::ILogger			
	void log(nvinfer1::ILogger::Severity severity, const char* msg) override
		// suppress info-level messages
        if (severity == Severity::kINFO) return;

        switch (severity)
            case Severity::kINTERNAL_ERROR: std::cerr << "INTERNAL_ERROR: "; break;
            case Severity::kERROR: std::cerr << "ERROR: "; break;
            case Severity::kWARNING: std::cerr << "WARNING: "; break;
            case Severity::kINFO: std::cerr << "INFO: "; break;
            default: std::cerr << "UNKNOWN: "; break;
        std::cerr << msg << std::endl;

// def constants
static const int K_TOP = 10; //3;
static const int INPUT_H = 1;
static const int INPUT_W = 10;
static const int INPUT_C = 5;
static const int OUTPUT_H = 1;
//static const int OUTPUT_W = 10;
static const int OUTPUT_W = K_TOP;
static const int OUTPUT_C = 5;

static const int BATCH_SIZE = 4;
static const int MAX_BATCH_SIZE = 4;

const char* INPUT_BLOB_NAME = "data";
const char* OUTPUT_BLOB_NAME = "out";

static void* buffers[2];
static cudaStream_t stream;
static int inputIndex, outputIndex;

using namespace nvinfer1;
static Logger gLogger;

// print tensor dimensions
void printDims(ITensor* data)
    Dims dims = data->getDimensions();
    int nbDims = dims.nbDims;
    for (int d = 0; d < nbDims; d++)
        std::cout << dims.d[d] << " ";
    std::string sss;    
    if (data->getType() == DataType::kHALF)
        sss = "float16";
    if (data->getType() == DataType::kFLOAT)
        sss = "float32";
    std::cout << sss << " ";
    std::cout << std::endl;

void APIToModel(unsigned int maxBatchSize, IHostMemory **modelStream)
	// create the builder
	IBuilder* builder = createInferBuilder(gLogger);

    INetworkDefinition* network = builder->createNetwork();

	// define input
	auto data = network->addInput(INPUT_BLOB_NAME, DataType::kFLOAT, DimsCHW{INPUT_C, INPUT_H, INPUT_W});
	assert(data != nullptr);
    std::cout << "input" << std::endl;

    // apply topK
    int reduceAxis = 0x4;
    auto topk = network->addTopK(*data, TopKOperation::kMAX, K_TOP, reduceAxis);
    std::cout << "topk0" << std::endl;
//    std::cout << "topk1" << std::endl;
//    printDims(topk->getOutput(1));
//    std::cout << topk->getNbOutputs() << std::endl;


	// Build the engine

    std::cout << "building the engine..." << std::endl;
	auto engine = builder->buildCudaEngine(*network);
         assert(engine != nullptr);
    std::cout << "engine built!" << std::endl;

	// serialize the engine, then close everything down
	(*modelStream) = engine->serialize();



void setUpDevice(IExecutionContext& context, int batchSize)
    const ICudaEngine& engine = context.getEngine();
    // input and output buffer pointers that we pass to the engine - the engine requires exactly IEngine::getNbBindings(),
    // of these, but in this case we know that there is exactly one input and one output.
    assert(engine.getNbBindings() == 2);

    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // note that indices are guaranteed to be less than IEngine::getNbBindings()
    inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);
    outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);

    // create GPU buffers and a stream
    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * INPUT_H * INPUT_W * INPUT_C * sizeof(float)));
    CHECK(cudaMalloc(&buffers[outputIndex], batchSize * OUTPUT_W * OUTPUT_H * OUTPUT_C * sizeof(float)));

    // create cuda stream

void cleanUp()
  	// release the stream and the buffers

void doInference(IExecutionContext& context, float* input, float* output, int batchSize)
	// DMA the input to the GPU, execute the batch asynchronously, and DMA it back:
	CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * INPUT_C * sizeof(float), cudaMemcpyHostToDevice, stream));
	context.enqueue(batchSize, buffers, stream, nullptr);
    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_W * OUTPUT_H * OUTPUT_C * sizeof(float), cudaMemcpyDeviceToHost, stream));

void printData(float *out, const int batch_size, const int output_c,  const int output_h,  const int output_w)
    int output_size(output_c * output_h * output_w);

    std::cout << "================="<< std::endl;   
    std::cout << "================="<< std::endl;
    for (int b = 0; b < batch_size; b++)
        std::cout << "-----------------"<< std::endl;
        for (int c = 0; c < output_c; c++)
                for (int h = 0; h < output_h; h++)
                    for (int w = 0; w < output_w; w++)
                        std::cout << out[b * output_size + c * output_h * output_w + h * output_w + w] << " ";
                    std::cout << std::endl;
            std::cout << "-----------------"<< std::endl; 
        std::cout << "================="<< std::endl;   
        std::cout << "================="<< std::endl;


int main(int argc, char** argv)
    // allocate CPU memory for input and output
    int inputSize = sizeof(float) * BATCH_SIZE * INPUT_C * INPUT_H * INPUT_W;
    int outputSize = sizeof(float) * BATCH_SIZE * OUTPUT_C * OUTPUT_W * OUTPUT_H;
    float *data = (float *)malloc(inputSize);
    float *out = (float *)malloc(outputSize);

    // init dummy input
    srand (time(NULL));
    for (int d = 0; d < BATCH_SIZE * INPUT_W * INPUT_H * INPUT_C; d++)
        data[d] = rand() % 100 + 1;;

    // print input
    printData(data, BATCH_SIZE, INPUT_C, INPUT_H, INPUT_W);

	// create a model using the API directly and serialize it to a stream
    IHostMemory *modelStream{nullptr};
    APIToModel(MAX_BATCH_SIZE, &modelStream);
	IRuntime* runtime = createInferRuntime(gLogger);
    ICudaEngine* engine = runtime->deserializeCudaEngine(modelStream->data(), modelStream->size(), nullptr);
    if (modelStream) modelStream->destroy();
	IExecutionContext *context = engine->createExecutionContext();

    // allocate device memory, do bindings
    setUpDevice(*context, BATCH_SIZE);

    // run inference
    doInference(*context, data, out, BATCH_SIZE);

	// destroy the engine

    // free device memory

    // print output
    //free host mem

    std::cout << "done!" << std::endl;

    return 0;


Thanks for your message. My replies follow your question numeration:

  1. The documentation is correct and I was able to successfully run your toy application with result that matches the expectations set in the link you’ve provided. For me it worked with both versions of TensorRT CUDA 9.0 or CUDA 8.0-based.

  2. The dimensionality of topK outputs is the same as the input. The shape will be different with one of the dimensions (specified by a bit set in “reduceAxes” parameter.) will have the size specified in “k” parameter.

In your code reduceAxes is 4, which is 1<<2. Hence, the 3rd (or 2nd zero-based) dimension (excluding batch) counting from the left hand side in your input tensor will be reduced to k. In your code input {INPUT_C, INPUT_H, INPUT_W} will result in {INPUT_C, INPUT_H, K_TOP}. Setting reduceAxes to 1 (i.e. 1<<0), for instance, would have resulted in {K_TOP, INPUT_H, INPUT_W}. Currently, only one dimension can be reduced, i.e. it’s not possible to find out k max values per channel across the whole plane.

Also, note that k may not be larger than the size of the reduced dimension-- in your case any other value for reduceAxes would have resulted in error since all other dimensions are less than k.

  1. Like mentioned above, I was unable to reproduce the behaviour you’re experiencing, which is, in fact, unexpected. The error indicates potentially some runtime problem, like resource contention. Perhaps, you might have a background process using GPU and preventing you from making an allocation. It may sound silly but, did you try rebooting computer? (it worked for me more often that I’d like to admit)

Also, is there a chance your installation was incomplete or upset by some recent changes. You may wish to try it on another computer or install CUDA 9.0-based package if you can afford it. Note that once you switch to CUDA 9.0-based TensorRT it won’t be easy if you decide to switch back.


  • Kostya
1 Like

Hi Kostya,
thanks for replying;
I keep having the same issue on another machine, with TensorRT based on CUDA 9.0;
I’ll let you know if I find something wrong with my installations;

meanwhile, could you please tell me which is the result printed by the function topk->getNbOutputs(), if you uncomment the line 121 of my example code?

thanks again,


Hi again,
I checked the TensorRT installation on my machines and indeed there was something wrong going on that caused the problem above; in both the machines I tested there was and old TensorRT 3.0 installation that was shadowing the updated libraries; when I did the test on a third machine with a fresh TensorRT installation everything worked fine;
thanks again for your help;

