Profiling single-gpu multi-session tf inference


Tensorflow version 1.15 - using the C++ API
Driver Version: 440.33.01
CUDA Version: 10.2
GPU Device: NVIDIA GeForce RTX 2080 Ti

I’m attempting to run 3 tensorflow sessions, each with its own copy of the tf computation graph on the same gpu. Having partitioned my 1 NVIDIA GeForce RTX 2080 Ti into 3 virtual devices (Using Experimental::VirtualDevices from, I launch 3 tensorflow sessions, each from a different thread and with a different device target. With this, I have been able to map operations from each session onto a different CUDA stream on the physical GPU device.

Using nsight-systems, I’m now profiling tensorflow session inference when I attempt to run inference in all 3 sessions at the same time to stress test gpu throughput.

After all 3 input tensors have been copied to the device for inference, cuda streams 18, 14 and 22 are executing operations concurrently but with little parallelism despite being data-parallel.

To me, this indicates that I might be compute resource constrained, how can I verify this? How can I find what resource is being the bottleneck in this situation? Are there any other tools, like nsight-compute that might be recommended to dig deeper into the issue?

Zooming in on the timeline where the active 3 cuda streams are executing concurrently:

Selecting a random kernel in the events view, say EigenKernel from the timeline above, I observe that the theoretical occupancy of this kernel is at 100%:

Begins: 19.9468s
Ends: 19.9471s (+274.238 μs)
grid:  <<<68, 1, 1>>>
block: <<<1024, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 0 bytes
Registers Per Thread: 30
Local Memory Per Thread: 0 bytes
Local Memory Total: 82,444,288 bytes
Shared Memory executed: 32,768 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 100 %
Launched from thread: 17124
Latency: ←1.165 ms
Correlation ID: 200386
Stream: Stream 14

Does this indicate that the kernel is being mapped to 100% of the cuda threads available to the physical gpu? In which case I might be compute constrained and should focus on optimizing the runtime of my tf compute graph by using lighter operations? Or can I alter the kernel to cuda thread mapping to take up fewer compute resources to encourage greater parallelism between cuda streams?

My goal here is to reduce the inference time per session.

Any insight / suggestions are greatly appreciated!