Partitioning

Hi there,

I am working on a Tesla c1060 on a cfd case.

My whole problem is unstructured and I am using metis for domain decomposition.

I have a few question regarding warps and max number of active threads.

As I understood how cuda works, I always assumed I have 30 SMs with a max of 8 SPs each, which would give me a total of 240 Blocks a can use concurrently. I further understood that 3 SMs share one thread block scheduler and each SMs has its own warp scheduler.

My question now is what does “max number of active threads per SM (appernetly 1024))” mean , I always assumed that 240 active blocks means that 240 warps (or 240*32 threads = 7680) can be executed concurrently.

But now I saw that because of 30 SMs → 30 Blocks can be executed concurrently (does that mean max of 30SMs32threads = 960threads concurrently???). And because of the max number of active threads 301024 = 30720 threads can run concurrently. How can I divide my problem up to get the optimal block, warp number ?

So for example in theory which approach would be faster , 1. make sure all 240 blocks are in use lets say with 32 (1 warp) threads each, which should be solved in 4 cycles concurrently. Or 2. use 120 blocks with 64 (2 warps)threads in it?

any help is very much appreciated

cheers

Markus