Description
I can’t seem to find a clear example on how to perform batch inference using the explicit batch mode.
I see many outdated articles pointing to this example here, but looking at the code, it only uses a batch size of 1. Other examples I see use implicit batch mode, but this is now deprecated so I need an example demonstrating how to use explicit batch mode.
How can I use a batch size larger than 1?
I follow the sampleOnnxMNIST.cpp
sample code to create the following:
You can assume that m_inputDims
and m_outputDims
are of type nvinfer1::Dims
and already contain relevant information.
bool InferenceEngine::infer() {
// Read the serialized model file
std::ifstream file(m_enginePath, std::ios::binary | std::ios::ate);
std::streamsize size = file.tellg();
file.seekg(0, std::ios::beg);
std::vector<char> buffer(size);
if (!file.read(buffer.data(), size)) {
throw std::runtime_error("Unable to read engine file");
}
std::unique_ptr<IRuntime> runtime{createInferRuntime(m_logger)};
if (!runtime) {
return false;
}
m_engine = std::shared_ptr<nvinfer1::ICudaEngine>(runtime->deserializeCudaEngine(buffer.data(), buffer.size()));
if (!m_engine) {
return false;
}
// Create RAII buffer manager object
samplesCommon::BufferManager buffers(m_engine); // TODO can specify the batch size in this call.
auto context = std::unique_ptr<nvinfer1::IExecutionContext>(m_engine->createExecutionContext());
if (!context) {
return false;
}
size_t batchSize = 1;
if (!processInput(buffers, batchSize)) {
return false;
}
// Memcpy from host input buffers to device input buffers
buffers.copyInputToDevice();
bool status = context->executeV2(buffers.getDeviceBindings().data());
if (!status) {
return false;
}
// Memcpy from device output buffers to host output buffers
buffers.copyOutputToHost();
const int outputSize = m_outputDims.d[1];
float* output = static_cast<float*>(buffers.getHostBuffer("2621"));
for (int i = 0; i < outputSize; ++i) {
std::cout << output[i] << " ";
}
std::cout << "\n\n\n" << std::endl;
return true;
}
And here is the definition for the processInput
function:
bool InferenceEngine::processInput(const samplesCommon::BufferManager &buffers, size_t batchSize) {
auto image = cv::imread("../img.jpg");
if (image.empty()) {
throw std::runtime_error("Could not load image");
}
cv::cvtColor(image, image, cv::COLOR_BGR2RGB);
const int inputH = m_inputDims.d[2];
const int inputW = m_inputDims.d[3];
// Preprocess code
image.convertTo(image, CV_32FC3, 1.f / 255.f);
cv::subtract(image, cv::Scalar(0.5f, 0.5f, 0.5f), image, cv::noArray(), -1);
cv::divide(image, cv::Scalar(0.5f, 0.5f, 0.5f), image, 1, -1);
float* hostDataBuffer = static_cast<float*>(buffers.getHostBuffer("input.1"));
int r = 0 , g = 0, b = 0;
for (int i = 0; i < 112 * 112 * 3; ++i) {
if (i % 3 == 0) {
hostDataBuffer[r++] = *(reinterpret_cast<float*>(image.data) + i);
} else if (i % 3 == 1) {
hostDataBuffer[g++ + 112*112] = *(reinterpret_cast<float*>(image.data) + i);
} else {
hostDataBuffer[b++ + 112*112*2] = *(reinterpret_cast<float*>(image.data) + i);
}
}
for (int i = 0; i < 30; ++i) {
std::cout << hostDataBuffer[i] << " ";
}
std::cout << "\n\n";
return true;
}
For a batch size of 1, this works great. However, how would I adapt the above code to work for a batch size greater than 1? This call here float* hostDataBuffer = static_cast<float*>(buffers.getHostBuffer("input.1"));
is only allocating enough memory for a single batch, so how do I ensure enough memory has been allocated for the number of batches I plan on running?
Additionally, something I am confused about, how does the call to bool status = context->executeV2(buffers.getDeviceBindings().data());
know the batch size, since we provide no argument which states how large the buffer being passed to the function call is.
Environment
TensorRT Version: 8.0.3.4
GPU Type: RTX 3080
Nvidia Driver Version: 465.19.01
CUDA Version: 11.3
CUDNN Version:
Operating System + Version: Ubuntu 20.04