how much thread blocks, threads per block are assigned at tensorflow-gpu?

I was researching how much of threads per block, blocks per grid is used in actual training. I did tried to read tensorflow cores and planning to read tensorRT white papers but was not sure if I am on a right way.
Is there a way I could find at least a hint?


The optimal launch configuration will depend on the specifics of the compute kernel. As such, there is no single value. You’ll need to identify a particular operation of interest to investigate. As far as the sources, some TF native kernels will use functions from tensorflow/core/util/gpu_launch_config.h when determining their launch configuration. For XLA, you might start in tensorflow/compiler/xla/service/gpu/partition_assignment.h.

Another approach is to run your network through the nvprof or nsight systems profilers. In the timeline view, the kernel properties will show the actual launch configuration used.

Thanks a lot. I should dig in some :)


Is there a method to find these numbers (thread blocks and threads per block) for a model compiled with TensorRT?

If you profile the application with nsight compute you can get CUDA block and grid launch configurations and much more.

is there a way to change the pre-set values?