Tensorflow version 1.15 - using the C++ API Driver Version: 440.33.01 CUDA Version: 10.2 GPU Device: NVIDIA GeForce RTX 2080 Ti
I’m attempting to run 3 tensorflow sessions, each with its own copy of the tf computation graph on the same gpu. Having partitioned my 1 NVIDIA GeForce RTX 2080 Ti into 3 virtual devices (Using Experimental::VirtualDevices from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/config.proto), I launch 3 tensorflow sessions, each from a different thread and with a different device target. With this, I have been able to map operations from each session onto a different CUDA stream on the physical GPU device.
Using nsight-systems, I’m now profiling tensorflow session inference when I attempt to run inference in all 3 sessions at the same time to stress test gpu throughput.
After all 3 input tensors have been copied to the device for inference, cuda streams 18, 14 and 22 are executing operations concurrently but with little parallelism despite being data-parallel.
To me, this indicates that I might be compute resource constrained, how can I verify this? How can I find what resource is being the bottleneck in this situation? Are there any other tools, like nsight-compute that might be recommended to dig deeper into the issue?
Zooming in on the timeline where the active 3 cuda streams are executing concurrently:
Selecting a random kernel in the events view, say EigenKernel from the timeline above, I observe that the theoretical occupancy of this kernel is at 100%:
EigenMetaKernel Begins: 19.9468s Ends: 19.9471s (+274.238 μs) grid: <<<68, 1, 1>>> block: <<<1024, 1, 1>>> Launch Type: Regular Static Shared Memory: 0 bytes Dynamic Shared Memory: 0 bytes Registers Per Thread: 30 Local Memory Per Thread: 0 bytes Local Memory Total: 82,444,288 bytes Shared Memory executed: 32,768 bytes Shared Memory Bank Size: 4 B Theoretical occupancy: 100 % Launched from thread: 17124 Latency: ←1.165 ms Correlation ID: 200386 Stream: Stream 14
Does this indicate that the kernel is being mapped to 100% of the cuda threads available to the physical gpu? In which case I might be compute constrained and should focus on optimizing the runtime of my tf compute graph by using lighter operations? Or can I alter the kernel to cuda thread mapping to take up fewer compute resources to encourage greater parallelism between cuda streams?
My goal here is to reduce the inference time per session.
Any insight / suggestions are greatly appreciated!