packed sse-like math-funcs for float4/int4 etc


i didn’t recover any packed math-funcs for vector data structures in the specs,
like for sse on the cpu.
this would be a great improvement especially for 3D calculations and projective
i know that the ALU design must have 128bit registers. is that possible with


See the FAQ. Current NVIDIA GPUs have a scalar architecture, so there is no advantage to vector types.


are there plans in the future?

You can think of the multiprocessors as 32-wide vector units. This (excellent) paper may help explain things:

GPUs made a wonderful innovation just a couple years back. They used to be SIMD for a long time, but someone very clever realized that you can turn the concept on its head and create SIMT. In SIMT you still have vector units, but each element of the vector is emulated to be an independent thread. As long as all threads inside the vector operate in lock-step, performance is just as great as with SIMD, but with far less coding effort. It’s a much more elegant solution than an autovectorizing compiler. If the threads in a vector wish to do different things, they may, with only a partial performance degradation (a much smaller degradation than if a compiler had to forgo vectorization entirely).

In CUDA, these concepts take the names of “warp”, “divergence”, “coalescing”, etc.

The additional hardware to turn SIMD into SIMT is not much, given the overwhelming benefits, and there’s not much reason to go back. The only downside is having to launch more logical threads.