In order to leverage faster shared memory, my applications determines a number of threads, M, that can execute on a block. Each thread is using a portion of the shared memory independent of the other threads.
The Tesla has 30 multiprocessor each with 8 cores. I am expecting that I can have 30 x 8 x M = 240 x M threads working in parallel simultaneously on the device. When I attach the NSIGHT debugger and select a CUDA thread context, it shows the dimensions of my thread space (M,1,1) and the dimension of my block space (240,1,1) but I cannot select a block index greater than 29 (0-29 = 30 blocks). This seems to indicate that I am only executing 30xM threads on the device simultaneously. Am I wrong in thinking that I should be able to execute 240 blocks of M threads simultaneously?
The relationship between blocks and hardware cores/multiprocessor is not clear to me. For example, on the Tesla, how many blocks can be executed in parallel? What is the significance of cores and multiprocessors and the product of the two?
Thanks very much in advance.