Two instructions should be particulary interesting for NVIDIA driver developers:
- MOVNTDQA – streaming load from USWC memory
This instruction should enable between 5.03x and 7.72x faster read speed from memory mapped I/O devices using USWC memory type.
- DPPS – Dot Product instruction
Sample SSE4.1 code using DPPS:
movaps xmm0, xmmword ptr [vec1] dpps xmm0, xmmword ptr [vec2], 0xF1 movss dword ptr [result], xmm0
It replaces the following SSSE3 code:
movaps xmm0, xmmword ptr [vec1] movaps xmm1, xmmword ptr [vec2] mulps xmm0, xmm1 haddps xmm0, xmm0 haddps xmm0, xmm0 movss dword ptr [result], xmm0
And it does it in 3 clocks instead of 5.
My question is when will NVIDIA driver developers implement those new instructions to allow us to take advantage of our new hardware?