Hi,
The overhead looks to be from clockoverlay plugin. It requires CPU buffers so would need to copy NVMM buffer to CPU buffer. And then copy back to NVMM buffer for hardware encoding.
You may try to use nvivafilter and then call cairo APIs. Please refer to
Tx2-4g r32.3.1 nvivafilter performance - #16 by DaneLLL
This would save overhead of the memory copy.