Best strategy for different parallelization degree

Hi, I have some already some little experience in cuda…

However I would like to know also your opinion regarding which could be the best strategy to implement a customized RSA algorithm:

I need to execute thousand of time (or cycles) a very complex calculations (montgomery multiplication). The problem is that on each of this cycle most of time I need just 33 threads, and just a couple of time 33*32 threads (for some even more complex calculations)… but of course these times can represent a bottleneck…

What would you do?

  • running 1056 threads (33*32) in parallel, and keep silent the most of them when I need just 33 of them

  • launching each time a kernel call with a different number of thread, e.g: with 33 threads I do most of work that I need to do, and then I call another kernel with 33*32 threads

Thanks in advance

There is some margin between the minimum number of threads per block needed to fully load an SM, and the maximum number of threads possible. If you sacrifice some of the parallelism possible (e.g. use 333 threads looping over the work instead of 3332 threads), you might find a good compromise. Just make sure that the threads that are idle most of the time fill complete warps, so that they don’t consume any cycles while waiting in __syncthreads().

Of course 33 threads is a very unfortunate number, as this requires another 31 idle threads to round the thread number up to a multiple of the warp size (32 threads). It might be worth some consideration if you can somehow cast the problem to a better block size.