I heard from somewhere that GPUs cost only one operation to perform an operation on all elements of a short vector (say float3). However, when I use arithmetic operations directly on short vectors, CUDA won’t compile. In my code, I have a lot of operations like
For something like float3, generally you will have a lot of them, perhaps thousands. Instead of using one operation for each float3, generaly there will be one operation for one component of many float3s, like this: