How determine max number of blocks and threads for a GPU?

Is there a way to determine the max number of grids, blocks per grid, and threads per block, on a given GPU?

for instance, the board Tesla C2075, as specified here
https://www.nvidia.com/docs/IO/43395/NV-DS-Tesla-C2075.pdf
has 448 cores

What does this means as far as the max number of blocks n and threads m that I can use when launching the kernel by

my_function<<<n,m>>>()

You got a few answers for the same question yesterday:
https://devtalk.nvidia.com/default/topic/1045204/cuda-programming-and-performance/where-do-i-check-the-number-of-blocks-and-threads-that-are-available-on-tesla-c2050-/

Thanks
I found the information in table 14

But I still don’t understand how many blocks can I create ?

My GPU is Tesla C2050/2075 and when I run the command deviceQuery I see:

Device 0: “Tesla C2075”
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5301 MBytes (5558501376 bytes)
(14) Multiprocessors, ( 32) CUDA Cores/MP: 448 CUDA Cores
GPU Max Clock rate: 1147 MHz (1.15 GHz)
Memory Clock rate: 1566 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla C2075
Result = PASS

The table 14 that you mentioned specifies that for CUDA capability 2.0 there are

Maximum number of threads per block = 1024
Maximum number of resident blocks per multiprocessor = 8

Does this means that the largest number of block and threads that can run my function in parallel is

my_function<<<8,1024>>>()
 ?

no it does not. GPU kernel launches can consist of many more blocks than just those that can be resident on a multiprocessor

The most immediate limits are these:

Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)

The kernel launch configuration: <<<…>>> specifies two dim3 quantities, the first being the number of blocks (in the grid) and the second being the number of threads in the block.

These are 3 dimensional quantities. Each dimension must satisfy the respective limit. Furthermore, the total number of threads in the block (i.e. the product of the 3 dimensions for threadblock size) must be less than 1024.

Anything within those limits is possible.

No. The number of blocks in a kernel launch is limited by the maximum dimensions of the grid. The first parameter between the triple angular brackets is the grid configuration (1D, 2D, or 3D): blocks in the grid. The second parameter is the block configuration (1D, 2D, or 3D): threads in each block.

So in this case, you can have a 1D grid of up to 65535 blocks or a 2D grid of up to 65536 x 655536 blocks, or a 3D grid of up to 65536 x 65536 x 65536 blocks.

In typical CUDA programs the number of blocks in a grid is significantly larger than the number of blocks that can execute simultaneously at any given time, which is max_resident_blocks_per_multiprocessor x number_of_multiprocessors (in this case: 8 x 14 = 112 blocks).

[I see I should have F5-ed, as Robert Crovella already supplied an answer …]