Hi,
I recently stumbled upon NVIDIA’s repo implementing accelerated pose estimation using TensorRT (GitHub - NVIDIA-AI-IOT/trt_pose: Real-time pose estimation accelerated with NVIDIA TensorRT). I made a stripped down C++ version of this implementation by extracting and serializing the TensorRT engine from the torch2trt output and running inference on it directly from C++.
The inference works well, but I notice that the inference FPS is slightly lower than stated in the repo. I am running the inference on a Jetson Nano using the resnet18_baseline_att_224x224 model which should, according to the repo, run at 22 FPS (excluding pre and post-processing I assume). When I measure the the time it takes to copy the input to the device buffers, run inference and copy the output back to the host buffers I get about 18 FPS instead of 22. Below is the code that I timed:
// input should be formatted as CHW and RGB while Mat objects are formatted as HWC and BGR.
// therefore copy to buffer one channel after another in RGB order
NV_CUDA_CHECK(cudaMemcpyAsync((float*)device_buffers[0], output_channels[2].data,
config.input_size.area() * sizeof(float), cudaMemcpyHostToDevice,
cuda_stream));
NV_CUDA_CHECK(cudaMemcpyAsync((float*)device_buffers[0] + config.input_size.area(), output_channels[1].data,
config.input_size.area() * sizeof(float), cudaMemcpyHostToDevice,
cuda_stream));
NV_CUDA_CHECK(cudaMemcpyAsync((float*)device_buffers[0] + 2 * config.input_size.area(), output_channels[0].data,
config.input_size.area() * sizeof(float), cudaMemcpyHostToDevice,
cuda_stream));
// do the inference
execution_context->enqueue(1, device_buffers, cuda_stream, nullptr);
// copy output from device buffer to host buffer
NV_CUDA_CHECK(cudaMemcpyAsync(output0_host_buffer, device_buffers[1],
config.num_part_types * config.output_map_size.area() * sizeof(float),
cudaMemcpyDeviceToHost, cuda_stream));
NV_CUDA_CHECK(cudaMemcpyAsync(output1_host_buffer, device_buffers[2],
2 * config.num_link_types * config.output_map_size.area() * sizeof(float),
cudaMemcpyDeviceToHost, cuda_stream));
// block until all GPU-related operations have ended for this inference
cudaStreamSynchronize(cuda_stream);
Is there a way to squeeze those missing 4 FPS out of the network? I am not a high-performance guy so I may be blind to some inefficiencies in my code.
Thanks,
Yinon