What is the settings (blocks per grid) & (threads per block) will be created when APIs of cublas or cudnn is called?

Hi:

Is there any references about “What is the settings (blocks per grid) & (threads per block) will be created when APIs of cublas or cudnn is called?”

I mean for example:
when cublasGemmEx is called, what is (blocks per grid) & (threads per block) will be set?

Moreover, if:

  1. multiple streams are created and each of them call, ex cublasGemmEx()

  2. multiple threads are launched & each of them call, ex cublasGemmEx()

  3. multiple processes are launched & each of them call, ex cublasGemmEx()

For each case above, is there any difference about the computational resources allocation policy?

Thanks~

To some degree these will vary by problem size.

There is no documentation, and any anything you discover may change from one CUDA version to the next, or even if running on a different GPU type.

You can discover the blocks and threads for any kernel call using a profiler.