cudaGraphicsUnmapResources performance overhead

Hi -

We currently use CUDA GL interop for our real-time graphics applications and noticed there is a performance overhead in the cudaGraphicsUnmapResources Runtime API which seems to scale linearly for the # of resources.

For example, unmapping 100 graphics resources will take 0.2~ ms. 1000 will take 2~ ms. 10000 takes 20~ ms.
There are often over 2000 buffers which we need to map & unmap per-frame and 4 ms is substantial for a real time application.

We’ve profiled using Nsight Systems and calling cudaDeviceSynchronize() before the Unmap call to make sure the cost is not due to waiting on device work to finish.

My questions are:

  1. What is cudaGraphicsUnmapResources doing under the hood that results in this overhead?
  2. Are there any approaches to reduce this overhead?