Detailed CUDA Implementation of TensorFlow

We are currently investigating how to deploy TensorFlow 2 for custom OP/deep learning on our product. We currently understand that a session is the place to execute a TensorFlow graph, which may include both deep learning OPs or self-defined (custom) OPs. To find out the best software architecture for our product, we would like to know how a tensorflow session is implemented in CUDA. Specifically, we would like to understand stream management, memory copy and compute execution in a session. In more depth, we would like to understand the implementation at OS level, e.g., running different sessions in a single process and in multiple processes. Is there a particular document for our questions either from tf2 or from Nvidia?

NVIDIA doesn’t directly maintain or support Tensorflow. You might want to ask your questions about it on the TF forums