I am using Tesla c1060 with AMD Phenom 9850 quad.
On page 82 of the NVIDIA_CUDA_Programming_Guide_2.1 it is mentioned for 1.0 compute capable devices that maximum
size of each dimension of a grid of thread blocks is 65535. For 1.3 compute capable devices, nothing is mentioned about the
maximum size of each dimension so I ‘assume’ that restriction of 1.0 are valid.
I have launched a gird with threads_ per_block = 512 and number_of_blocks = ceil((1024102410)/512)
Kernel launch: kernel_name <<<number_of_blocks,threads_per_block>>> (argument_list)
Kernel launch is successful and computations . Total_threads_created = 1024 * 1024 *10.
How is this possible ? Does the CUDA runtime checks for this constraint and schedules the number of
blocks accordingly i.e only that many blocks will run at a any given time such as to have
total_number_of_running_threads <= 65535 ?If such scheduling procedure exists then is it specific to 1.3
compute capable devices only ?