Thread specialization optimization

Hello I have bunch of small functions to be executed by just single lane in thread block , I know that this is not perfect as it will lead to warp divergence, but it needs to be done
Now in order to be sure that I am doing it on single thread i check am i in threadx so for example

If thread 1 functionA
If thread 2 functionB
If thread 3 functionC
And so on let’s say sth like 50 functions

And so on when possible i try to make those tiny functions executed on separate warps- depending on number of functions and number of warps

FunctionA,B,C can run concurrently their results will be reduced later

Now I know that GPU is not perfect for such tasks but what i describe is just one part of the kernel, hence how to make it most efficient? I have some thoughts

  1. is it better to check for thread using set of if statements or else if
  2. i had also some idea although not sure it is possible in cuda to have list of lambda functions and just access those using thread Id- it would in theory avoid branching but suppoose it may not be good idea
  3. also have idea for tree like nested ifs so first I will check is thread bigger or smaller than 25 than is in ranges like 1 to 10 10 to 20 and so on and finally is it in given thread in such nesting there could be potentially less if seen by given thread only the tree hight number

I am aware that in all cases good idea is to profile and check yet i am creating some small utility library for personal use and i would like to keep this generic
Any suggestions?

Thanks for help!

How many thread blocks to you use? Does every thread 1 of a thread block execute functionA ?

1 Like

Sth like 70 blocks yes each thread block executes this function, and function execution is based on data from the kernel and it’s results are also used by the kernel down the line

I mean it would be rather counter productive to push data required to global memory synchronize grid than execute function A b c … in blocks on all data points synchronize grid and then on, if it is what you suggest, but of course I will consider the idea if it is what you suggested maybe some refractoring may help

Still let,s assume for a moment that those functions needs to be executed in the single block and each once

Functions are also generally very simple and fast

If it were me, I would seek to design the algorithm so that rather than have thread 1 in all 70 blocks run functionA, I would try to have 70 theads in block 1 run functionA.

I guess you’re saying that’s not possible.

I doubt you’re going to find much meaningful wisdom on a bunch of if statements vs. else-if. I’d be surprised if that sort of variation made much difference to the compiler, and even if it did you might find a changing preference as you move from one compiler version to the next.

I think its a good idea if you can “standardize” the behavior to a single function prototype, but I doubt that is going to eliminate the issue here if the code is indeed doing something quite different thread-to-thread. I doubt there is any lambda magic that actually would work around this, if the lambdas are different.

1 Like

If these are functions of a computational nature, would it be possible to unify the computation such that all threads use a common computational framework, with the only difference between threads being different data? For example, all functions might be representable as a (sequence of) matrix multiplies, or all functions might be representable as polynomials.

This is interesting idea, maybe some of it would be amenable for transformation into linear algebra , in some i may be successful in other not still it is very good idea thanks !