Multiprocessors, Cores, Threads and Parallelism

Hello Forum,

In order to leverage faster shared memory, my applications determines a number of threads, M, that can execute on a block. Each thread is using a portion of the shared memory independent of the other threads.

The Tesla has 30 multiprocessor each with 8 cores. I am expecting that I can have 30 x 8 x M = 240 x M threads working in parallel simultaneously on the device. When I attach the NSIGHT debugger and select a CUDA thread context, it shows the dimensions of my thread space (M,1,1) and the dimension of my block space (240,1,1) but I cannot select a block index greater than 29 (0-29 = 30 blocks). This seems to indicate that I am only executing 30xM threads on the device simultaneously. Am I wrong in thinking that I should be able to execute 240 blocks of M threads simultaneously?

The relationship between blocks and hardware cores/multiprocessor is not clear to me. For example, on the Tesla, how many blocks can be executed in parallel? What is the significance of cores and multiprocessors and the product of the two?

Thanks very much in advance.

Hello Forum,

In order to leverage faster shared memory, my applications determines a number of threads, M, that can execute on a block. Each thread is using a portion of the shared memory independent of the other threads.

The Tesla has 30 multiprocessor each with 8 cores. I am expecting that I can have 30 x 8 x M = 240 x M threads working in parallel simultaneously on the device. When I attach the NSIGHT debugger and select a CUDA thread context, it shows the dimensions of my thread space (M,1,1) and the dimension of my block space (240,1,1) but I cannot select a block index greater than 29 (0-29 = 30 blocks). This seems to indicate that I am only executing 30xM threads on the device simultaneously. Am I wrong in thinking that I should be able to execute 240 blocks of M threads simultaneously?

The relationship between blocks and hardware cores/multiprocessor is not clear to me. For example, on the Tesla, how many blocks can be executed in parallel? What is the significance of cores and multiprocessors and the product of the two?

Thanks very much in advance.

The threads of one block are executed on one multiprocessor. But multiple blocks can run on one multiprocessor if there are enough resources (register, shared memory). The maximum number of blocks you can start are 65536 * 65536 (x * y dimension). If your blocks dont fit on the multiprocs all at once they are serialised (i.e. if 2 blocks fit on each multiproc, than first blocks 0-59 are run, then 60-119 and so on). The cores of each multiproc are in synch (they do the same operation at the same time).

The number of cores per multiproc effectively determines how many threads of each block can be worked on in the same cycle. I.e. on the C1060 8 threads per block are physically serviced in each cycle, while on the C2050 32 threads are active in each cycle.

Hope that helps
Ceearem

P.S. I have simplified some things but the basic principals should be ok

The threads of one block are executed on one multiprocessor. But multiple blocks can run on one multiprocessor if there are enough resources (register, shared memory). The maximum number of blocks you can start are 65536 * 65536 (x * y dimension). If your blocks dont fit on the multiprocs all at once they are serialised (i.e. if 2 blocks fit on each multiproc, than first blocks 0-59 are run, then 60-119 and so on). The cores of each multiproc are in synch (they do the same operation at the same time).

The number of cores per multiproc effectively determines how many threads of each block can be worked on in the same cycle. I.e. on the C1060 8 threads per block are physically serviced in each cycle, while on the C2050 32 threads are active in each cycle.

Hope that helps
Ceearem

P.S. I have simplified some things but the basic principals should be ok

Thanks! Yes, this helps tremendously. This relationship is not at all clear in the CUDA books and documentation – but a did find a bit on the CUDA model overview slide presentation. Thanks again.

Thanks! Yes, this helps tremendously. This relationship is not at all clear in the CUDA books and documentation – but a did find a bit on the CUDA model overview slide presentation. Thanks again.