I was researching how much of threads per block, blocks per grid is used in actual training. I did tried to read tensorflow cores and planning to read tensorRT white papers but was not sure if I am on a right way.
Is there a way I could find at least a hint?
The optimal launch configuration will depend on the specifics of the compute kernel. As such, there is no single value. You’ll need to identify a particular operation of interest to investigate. As far as the sources, some TF native kernels will use functions from tensorflow/core/util/gpu_launch_config.h when determining their launch configuration. For XLA, you might start in tensorflow/compiler/xla/service/gpu/partition_assignment.h.
Another approach is to run your network through the nvprof or nsight systems profilers. In the timeline view, the kernel properties will show the actual launch configuration used.