double precision on mobile GPU

Hi All,

was not able to find things by search, so I’ll ask here:

Does the Quadro 2000/3000/4000M support double precision (the slow Geforce way)? Or is the 5010M the only one with double precision (the fast Tesla/Quadro way)?

If anybody has more information, that would be great, since for demo’s I would want to have double precision support.

Thanks in advance,
Denis

Hello,

Go to this page http://developer.nvidia.com/cuda-gpus and check the Compute Capability. If It is is below 1.3 there is no double precision support. For the specific cards it is.

Quadro 5000M 2.0

Quadro 4000M 2.0

Quadro 3000M 2.0

Quadro 2000M 2.0

So they should all have double precision. When you compile a code in gppu put the flag -arch=sm_20 to force the 2.0, otherwise will compile outmatially for 1.0 and with no double.

I think his question is whether the double precision throughput is capped to 1/8 the single precision throughput in the same way that the GeForce cards are. I seem to recall a statement that the desktop Quadros run at the Tesla level of double precision at 1/2 the rate of single precision, but I have no idea what the mobile GPUs are set to.

I see. My bad. The 2000/3000/4000/5000 cards use GF104 and 106 this is the same as the 400 series which have half performance in double precision compared to single precision. The quadro 5010m card is with the same chip as the 580 gtx which had equal performance in double or single precision.

Cristian

Here is the wikipedia page with al the cards http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units

Actually I do not really require full-blown double precision speed. Just the support would be nice.

I got confused by these pages:
http://www.notebookcheck.net/NVIDIA-Quadro-5010M.47195.0.html
and
http://www.notebookcheck.net/NVIDIA-Quadro-4000M.47295.0.html

where for 5010M it is stated: The Quadro 5010M is the successor of the Quadro 5000M and also offers ECC RAM and double-precision floating point cores
and for the 4000M: Compared to the 5010M, the 4000M does not support ECC memory and DP floating point calculations

So I have some doubt (given that notebookcheck always has quite good information), does anybody have some experience with double precision on notebook gpu’s?

As far as I understand it, if the GPU is compute capability 2.0 or 2.1, then it has to support double precision. There is no equivalent in Fermi to the capability 1.2 mobile chips of the previous generation that were just like capability 1.3, but without the double precision.

Unfortunately, it sounds like the laptop vendor is confused. If they are correct about GF104, then you should have double precision at 1/12 of the maximum single precision throughput.

Quadro 5000M 2.0
Quadro 4000M 2.0
Quadro 3000M 2.0
Quadro 2000M 2.0
I am starting to get confused. These cards are based on the same chips as the 400m series which support double precision at half speed compared to single precision. Same for the 5010M.
Quadro 1000M GF108
Quadro 2000M GF106
Quadro 3000M GF104
Quadro 4000M GF104
Quadro 5000M GF100
Quadro 5010M GF110GLM

GeForce GT 435M GF108
GeForce GTX 460M GF106
GeForce GTX 470M GF104
GeForce GTX 480M GF100
GeForce GTX 485M GF104
M2090 GF110

It is strage. According to this nvidia page only the 5010M cards supports double precision http://www.nvidia.com/object/quadro-mobile-features-benefits.html, but according to this page http://www.nvidia.com/object/product-quadro-5000m-us.html the 5000M card supports double precision and ECC as well.

Well, that can be because the 5010M is the successor of the 5000M, so the 5010M can be the only mobile card they currently manufacture to support DP.

Interesting to see that it is based on the same CHIP-code as M2090.

Anyhow, the Quadro 4000M is ordered as the 5010M is quite a lot more expensive, I’ll have to give it a try when it gets here :)

Yes so I guess the 2000/3000/4000 cards are 2.0, but do not support double precision, while 5000 and 5010 do. Not so clear on their website.

All Fermi cards support double precision at least at 1/8 of single. I have the 2000m and it definitely support double precision (at 1/8 perf. of single). I don’t know about the mobile GPUs, but Quadro 4000 and up and Teslas run at 1/2 single precision speed.

It is stated explicit on the nvisia webpage that only 5000M and 5010M have double precision, while the other 3 mentioned do not have double precision.

They are talking about double precision at 1/2 single precision, not any double precision support. I know for a fact from first hand experience that the 2000m, 1000m, gtx 430, gtx480, gtx 570 and from second hand experience regarding every Fermi GPU possibly excluding the wierd laptop Fermi’s with 16 cores which I don’t know about. I did read somewhere that the performance is 1/12 rather than 1/8 of single precision, but it’s definitely there, regardless of how NVIDIA word things on the website.

By the way, using CUDA-z at the moment on the Quadro 2000m which is a Compute 2.1 GPU and which you claim has no double precision support, I see:

Single precision float - 279470 Mflops

Double precision float - 35087.3 mflops (1/7.97)

I guess I misunderstand this statement from the nvidia. I thought something else. Good to know.

Fast 64-Bit Floating Point Precision

Industry’s fastest double precision floating point performance enabling accurate results on mission-critical applications… Available only on Quadro 5010M.

Knowing NVIDIA and the hype around the HPC market, the person responsible for the page put the emphasis on “Fast 64-Bit” rather than “64-Bit”. The difference is with the speed, not the support.

Well, I will certainly run CUDA-z when I get the laptop (somewhere in november…)
I was able to get the 5010M for a small extra prize, so then we should see only a factor of 2 difference :)

Good to hear that double is supported on all Fermis though, that makes it much more worthwhile to change some code to double.

Well, GPU-z did not really work very well, it shows the first page with info and then hangs.
The nbody example however showed 270 SP GFLOPS and 150 DP GFLOPS, so that suggests that it is indeed a factor of 2 difference.

If anybody wants to have some (windows only for the time being) benchmarks run on it, let me know.

Hello,

I do not have a benchmark, but I have some code that you can test. It does some a bunch of real to complex FFT and some additions and multiplication. I have single and double precision versions. The attached codes should run less than 10 minutes and at the end will give some time.

SPRTinplaceTwoDPFCtwodblock.cu is the single precision code
DPRTinplaceTwoDPFCtwodblock.cu is the double precision code.

For the double precision the flag -arch=sm_20 in order to keep the double precision. I am not running windows so I am not sure how to compile on windows, but if you would have a command line the compile line would be:

nvcc -O2 -lcufft -arch=sm_20 DPRTinplaceTwoDPFCtwodblock.cu
or
nvcc -O2 -lcufft SPRTinplaceTwoDPFCtwodblock.cu
DPRTinplaceTwoDPFCtwodblock.cu (8.57 KB)
SPRTinplaceTwoDPFCtwodblock.cu (8.39 KB)

It might be after Wednesday as I have the laptop in a demo setup now, but I’ll see if I can try these out this week. I’ll try with CMAKE.