this is part of the kernel code i am trying to write
sharedmemlocation = register + register;
sharedmemlocation = register + global
sharedmem localion =global +register ;
sharedmemloaction =global +global;
given that different threads inside a wrap take different cases will i suffer two global memory read latency for the wrap as a whole ; ?
as in will all the threads wait till global mem loads come back or will some threads in a wrap execute case 1 and nops for all other threads ?
and then when data comes back case 4 for some threads and nops for the rest ?
There are two possibilities for branches AFAIK. Either the warps get serialized, which is bad, or they use predication.
In the latter, mostly both branches are done and the result of the true one is used. Sometimes threads/warps for both cases are issued and only the true warps are executed.
Perhaps as a lower limit. Take the performance you get if you would do all computations for all branches in any case and use only the true one. If that performance is enough for you it is worth the shoot.
So to get you started. Your pseudo code makes the impression you have only few computations. So my guess is the memory bandwidth will restrain you (as always ). So all your branches together need 4 global loads. This would take a balance of 4 words per flop and mean you get 86.4 GB/ 4 bytes per flop = 20 GFLOP.