I was researching how much of threads per block, blocks per grid is used in actual training. I did tried to read tensorflow cores and planning to read tensorRT white papers but was not sure if I am on a right way.
Is there a way I could find at least a hint?
The optimal launch configuration will depend on the specifics of the compute kernel. As such, there is no single value. You’ll need to identify a particular operation of interest to investigate. As far as the sources, some TF native kernels will use functions from tensorflow/core/util/gpu_launch_config.h when determining their launch configuration. For XLA, you might start in tensorflow/compiler/xla/service/gpu/partition_assignment.h.
Another approach is to run your network through the nvprof or nsight systems profilers. In the timeline view, the kernel properties will show the actual launch configuration used.
Thanks a lot. I should dig in some :)
Is there a method to find these numbers (thread blocks and threads per block) for a model compiled with TensorRT?
If you profile the application with nsight compute you can get CUDA block and grid launch configurations and much more. https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels/
is there a way to change the pre-set values?