Hi, I have some already some little experience in cuda…
However I would like to know also your opinion regarding which could be the best strategy to implement a customized RSA algorithm:
I need to execute thousand of time (or cycles) a very complex calculations (montgomery multiplication). The problem is that on each of this cycle most of time I need just 33 threads, and just a couple of time 33*32 threads (for some even more complex calculations)… but of course these times can represent a bottleneck…
What would you do?

running 1056 threads (33*32) in parallel, and keep silent the most of them when I need just 33 of them

launching each time a kernel call with a different number of thread, e.g: with 33 threads I do most of work that I need to do, and then I call another kernel with 33*32 threads
Thanks in advance