Yes, you have to define your own operator or use elementwise addition, as hyqneuron wrote. The compiler is still good enough to optimize necessary memory accesses into 256-bit loads and stores.
Elementwise addition works OK of course, I rewrote that kernel to use it, but… honestly it looks odd, too odd if you recall that it’s not CUDA 1.0 alpha, but CUDA 3.2 release…
To have vector type and not to have basic operations on it… weird!
Moreover, I think I’ve seen addition for float4 type somewhere in CUDA samples. Interesting, will that sample compile?..
I tought something wrong with my CUDA setup, but if it’s lack of vector addition indeed…