This should be simple to answer for somebody who really knows CUDA well I think. I have a Tesla1060c. How many threads can be run concurrently?
More specifically, say I have blocks which are 16 x 16 threads. This means 4 blocks can fit on each multiprocessor. Does every thread in these 4 blocks run in parallel (or close to it)? If that is the case, then 30x1024 = 30720. Is this number the answe to my first question? Also, say there are 1 million threads in a grid. At what level does this code run sequentially, i.e. do I run 30x4 blocks in parallel at a time, and when those are finished, move to the next 120 blocks?