tl;dr - my enqueue function will only operate correctly if its batchSize argument is equal to that specified at engine build time.
I am currently trying to build a model. The model has intermediate steps that each act on tensors from the preceding step. The first step of the model will output a variable batch sized tensor. For the second step, I have have an engine called by the function below. I have nbBuffer=2 and the output for a single input is of size 2*sizeof(float).
I construct batches of size batchSize in the for loop and pass these through my engine. The inference works fine. In some cases, there are left-over elements that could not be packaged into a full size batch. These are passed through the engine with a batch size of missed_inputs. For some reason, this final inference call gives out incorrect data (I can check the correct form of the output data by specifying batchSize=1). The TensorRT documentation states clearly in multiple places that this should still work. Does anyone understand what my issue is?
void Inference(void** buffers, int nbBuffer, int batchSize, int in_size, int tot_num_inputs)
{
// buffers: Array of pointers to all the inputs and outputs of net.
// nbBuffer: Number of elements in buffers
// batchSize: the maximum batch size specified at engine build time
// in_size: size of individual (non-batched) tensor to the network
// tot_num_inputs: total number of elements that should be run through the network (batch size * number of batches + remainder)
assert(engine->getNbBindings()==nbBuffer);
IExecutionContext* context = engine->createExecutionContext();
context->setProfiler(&gProfiler);
cudaStream_t stream;
cudaStreamCreate(&stream);
int num_of_batches = tot_num_inputs/batchSize;
int missed_inputs = tot_num_inputs - num_of_batches*batchSize;
assert(missed_inputs >=0);
for (int i = 0; i < num_of_batches; i++)
{
assert(context->enqueue(batchSize, buffers, stream, nullptr));
buffers[0] = static_cast<void*>(static_cast<float*>(buffers[0]) + batchSize*in_size);
buffers[1] = static_cast<void*>(static_cast<float*>(buffers[1]) + batchSize*2);
}
if(missed_inputs > 0)
{
cout << "missed inputs" << endl;
assert(context->enqueue(missed_inputs, buffers, stream, nullptr));
}
cudaDeviceSynchronize();
buffers[0] = static_cast<void*>(static_cast<float*>(buffers[0]) - num_of_batches*in_size);
buffers[1] = static_cast<void*>(static_cast<float*>(buffers[1]) - num_of_batches* 2);
context->destroy();
}