Although I assume the compiler will be able to optimize some of your other expressions as well. The most important optimization is to have only one global memory transaction.
Although I assume the compiler will be able to optimize some of your other expressions as well. The most important optimization is to have only one global memory transaction.
the speed will depend on the amount of warp divergence.
the third option you posted is fastest under presence of divergent paths within warps, as it will trigger only a single memory transaction (per half warp).
the speed will depend on the amount of warp divergence.
the third option you posted is fastest under presence of divergent paths within warps, as it will trigger only a single memory transaction (per half warp).
As a lot of optimization happens past the PTX stage, it’s sometimes more helpful to dissassemble the actual binaries using decuda, nv50dis/nvc0dis. or cuobjdump.
As a lot of optimization happens past the PTX stage, it’s sometimes more helpful to dissassemble the actual binaries using decuda, nv50dis/nvc0dis. or cuobjdump.