enqueue/execute batch size argument must be same as maximum specified at build

tl;dr - my enqueue function will only operate correctly if its batchSize argument is equal to that specified at engine build time.

I am currently trying to build a model. The model has intermediate steps that each act on tensors from the preceding step. The first step of the model will output a variable batch sized tensor. For the second step, I have have an engine called by the function below. I have nbBuffer=2 and the output for a single input is of size 2*sizeof(float).

I construct batches of size batchSize in the for loop and pass these through my engine. The inference works fine. In some cases, there are left-over elements that could not be packaged into a full size batch. These are passed through the engine with a batch size of missed_inputs. For some reason, this final inference call gives out incorrect data (I can check the correct form of the output data by specifying batchSize=1). The TensorRT documentation states clearly in multiple places that this should still work. Does anyone understand what my issue is?

void Inference(void** buffers, int nbBuffer, int batchSize, int in_size, int tot_num_inputs)
{  
    // buffers: Array of pointers to all the inputs and outputs of net.
    // nbBuffer: Number of elements in buffers
    // batchSize: the maximum batch size specified at engine build time
    // in_size: size of individual (non-batched) tensor to the network
    // tot_num_inputs: total number of elements that should be run through the network (batch size * number of batches + remainder)

    assert(engine->getNbBindings()==nbBuffer);  
    IExecutionContext* context = engine->createExecutionContext(); 
    context->setProfiler(&gProfiler);  
 
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    int num_of_batches = tot_num_inputs/batchSize;
    int missed_inputs = tot_num_inputs - num_of_batches*batchSize;
    assert(missed_inputs >=0);
    for (int i = 0; i < num_of_batches; i++)
    {       
        assert(context->enqueue(batchSize, buffers, stream, nullptr));
        buffers[0] = static_cast<void*>(static_cast<float*>(buffers[0]) + batchSize*in_size);
        buffers[1] = static_cast<void*>(static_cast<float*>(buffers[1]) + batchSize*2);
    }
    if(missed_inputs > 0)
    {
        cout << "missed inputs" << endl;
        assert(context->enqueue(missed_inputs, buffers, stream, nullptr)); 
    }
    cudaDeviceSynchronize();
    buffers[0] = static_cast<void*>(static_cast<float*>(buffers[0]) - num_of_batches*in_size);
    buffers[1] = static_cast<void*>(static_cast<float*>(buffers[1]) - num_of_batches* 2);
 
    context->destroy();
}