GPGPU readback, MOVNTDQA, DPPS, drivers When will this get implemented?

Dear NVIDIANS,

As you may have noticed, on January, 20th new Intel CPUs (E8xxx family) supporting SSE4.1 instruction set extensions (PDF) have became available in retail.

Two instructions should be particulary interesting for NVIDIA driver developers:

  1. MOVNTDQA – streaming load from USWC memory

This instruction should enable between 5.03x and 7.72x faster read speed from memory mapped I/O devices using USWC memory type.

  1. DPPS – Dot Product instruction

Sample SSE4.1 code using DPPS:

 movaps	xmm0, xmmword ptr [vec1]

  dpps	xmm0, xmmword ptr [vec2], 0xF1

  movss	dword ptr [result], xmm0

It replaces the following SSSE3 code:

 movaps	xmm0, xmmword ptr [vec1]

  movaps	xmm1, xmmword ptr [vec2]

  mulps	xmm0, xmm1

  haddps	xmm0, xmm0

  haddps	xmm0, xmm0

  movss	dword ptr [result], xmm0

And it does it in 3 clocks instead of 5.

My question is when will NVIDIA driver developers implement those new instructions to allow us to take advantage of our new hardware?

Thanks, we are aware of the new SSE instructions and are continually evaluating new CPU functionality to improve our drivers

Thanks for replying Simon, does your answer mean new instructions will be used and if so can you give us any hints as to when we could expect that?

If I understand correctly MOVNTDQA could improve GPGPU readback speed considerably and give you a serious advantage in GPGPU over your competitors.

I’m not sure; as far as I have seen, NVidia uses DMA for all transfers between the card and the host. Using this instruction would make transfers synchronous and I don’t think it’d be faster than DMA.

Well if 800 MB/sec for OpenGL readback and 2,500 MB/sec in CUDA (with my 8800GTX) is the best they can do with DMA (provided that is with DMA at all), then yes, this would be faster because it is very close to theoretical bandwidth limit of the Front Side Bus. At 1066MHz FSB that would be close to 8.5 GB/sec.

It would be nice if someone from NVIDIA could chime in and say “possible / not possible”.

And what about PCI-Express bandwidth limitation of 4GB/s each direction?

And whatever bandwidth is needed to write things back to memory. Presuming this fancy new instruction could max out PCIe2 bandwidth of 8GB/s, that data is still coming over the FSB into the proc and then going out to memory over the FSB, so the FSB is going to limit it. DMA is better because it doesn’t involve the CPU. And with the right mobo, you can get 3.2 GB/s transfer to/from the card which is as close to the theoretical peak as you can get due to some overhead in the PCIe transfers (info from a post by Mark Harris that I’m too lazy to link to).

Whatever, you can’t get 3.2GB/s readback with any mainboard – perhaps if you are Mark Harris but unfortunately I an not him.

And are you sure that using new instructions will raise actual bandwidth? I really doubt it since CPU is not the bottleneck here.

Huh, well, I get 3.1 GiB/s all the time on my system (an ancient 3 year old dell). tachyon_john is using an NVIDIA mainboard with LinkBoost that overclocks the PCIe lanes and achieves over 4 GiB/s IIRC.

Edit - Link: http://forums.nvidia.com/index.php?showtop…39&hl=linkboost (I guess the linkboost only got 3900MB/s). Anyways, just search the forums, a lot of people report ~3.1 GiB/s, though most systems do seem to average around 2.5 GiB/s.)

I’m still waiting to see someone post benchmarks with the latest PCIe v2 mobos and mathing G92 cards … anyone?

The new instructions might improve performance of pageable host memory.

They will not improve performance of host memory allocated with cuMemAllocHost. But, apps that use cuMemAllocHost might benefit from doing streaming loads from the buffer written by device->host memcpy.

I can’t be 100% sure, that is why I am asking NVIDIA driver developers.

The instruction improves bandwidth of reading from USWC memory:

http://softwarecommunity.intel.com/articles/eng/1248.htm

How that may be used by NVIDIA I don’t know but the gains are certainly worthwile.

As for others boasting 3.2GB/sec that is host to device transfer. What I am more interested in is device to host transfer especially in OpenGL which seems to be a lot slower for some reason. That is also the place I believe this instruction might prove usefull but maybe I am wrong, again – that is why I asked in the first place.

nForce 780i chipset (Asus P5N-T Deluxe) with a 8800GT, no tweaking what so ever;

Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               3188.3

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               3193.5

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432               48265.5

Since the efficient memory bandwidth to the PC2-6400 RAM used is about 3750 MB/s (according to memtest86+) I guess it can be a limiting factor.

These are some measurements (MB/s) on Gen2 motherboard with Gen2 card (8800GT) for 3 different systems:

Cool. Thanks for the benchmarks guys. It seems that PCIe gen 2 is fast enough that the RAM speed starts to become a limiting factor.

Any ideas how an Intel X38 based DDR2 system (the Sun Ultra 24) can be so much faster then a NVIDIA nForce-780i based DDR2 system? Don’t the GeForce and nForce like each other? ;-)

  1. Why 2 results have 3.7x slower readback?

  2. How about OpenGL readbac?

  3. Will that PCI-E 2.0 bandwidth be usable for other applications or only for bandiwdth benchmarks?

I have essentially the same questions. With a 780i, 8800GT and SLI enabled memory (read PCIe 2.0 and G92 core) I see speeds similar to a 680i / 8800 GTX for host->device and device<-host per the bandwidth test, just a 10% improvement Should I be looking at the Intel chipset for CUDA applications?

Is there some tuning / programming tool available for the 780i?


Quick Mode

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 3964.8

Quick Mode

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 4014.1

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 48218.2

&&&& Test PASSED

Press ENTER to exit…


Replacing my Nvidia 780i motherboard (Asus P5N-T Deluxe) with an Intel X38 (Asus P5E WS PRO), the bandwidth host<->device increased from ~3200 to ~4600MB/s pinned (both directions) and memory bandwidth measured by memtest86+ increased from 3749 to 4479 MB/s. All other hardware was the same, and running default BIOS settings in both cases and most recent BIOS.

– Kuisma