operation on short vectors is there a way to reduce

I heard from somewhere that GPUs cost only one operation to perform an operation on all elements of a short vector (say float3). However, when I use arithmetic operations directly on short vectors, CUDA won’t compile. In my code, I have a lot of operations like

float3 pos, dir;

  float alpha;

  ...

  pos=float3(pos.x+alpha*dir.x, pos.y+alpha*dir.y, pos.z+alpha*dir.z)

  ...

I am wondering if there is a way to write like

pos=pos+alpha*dir;

and only costs 2 operation units? (I remember 4 cycles for one float op, right?)

how many cycles are needed for the fully expanded version?

bump

anyone want to comment on this?

Each operation has to occur in a separate thread, and then yes, it can occur in a single operation.

For example:

float alpha, pos[32], dir[32];

pos[threadIdx.x] = pos[threadIdx.x] + alpha * dir[threadIdx.x];  // single multiply-add operation updates 32 values

For something like float3, generally you will have a lot of them, perhaps thousands. Instead of using one operation for each float3, generaly there will be one operation for one component of many float3s, like this:

float3 pos[32], dir[32];

float alpha

pos[threadIdx.x].x = pos[threadIdx.x].x + alpha * dir[threadIdx.x].x;  // single multiply-add operation

pos[threadIdx.x].y = pos[threadIdx.x].y + alpha * dir[threadIdx.x].y;  // single multiply-add operation

pos[threadIdx.x].z = pos[threadIdx.x].z + alpha * dir[threadIdx.x].z;  // single multiply-add operation

Here with only three operations, all three components of 32 float3 vectors have been updated.