references for CUDA + double precision


Other than the CUDA manual, are there any publications (e.g. conferences/journals) that describe the CUDA architecture? I see some slides from a workshop at SC’06. Is there anything else?

Also, I have heard it said that CUDA will eventually support double precision. Is there some document etc. which describes that? Also, will new hardware be necessary for that (beyond the 8800 GTX?), or will the existing drivers be upgraded to support that.

We appreciate any information on this.


There is a class at UIUC with a lot of online material (

You will need new hardware for double precision.
It is going to be supported in the next generation of GPUs ( towards the end of the year).

I am very interested in double precision on the GPU. I wonder, will it be full double precision or something like in the Cell, 200+GFlops single, 15GFlops double ?



it appearts that the next phase of cuda hardware and software will support double precision only in tesla. Here’s the second page in an eight page interview:…irk-interview/2

That’s too disappointed

There is a little bit of confusion.

This is what David Kirk said:

Tesla 2 is the name of the next generation architecture. He is not referring to the Tesla product line. Double precision will be on all the product lines (GeForce, Quadro, Tesla) using “Tesla 2” GPUs.

That is good news, because before it has been stated double precision would be tesla-only (somewhere back last year)

Phew, starting to panic there! Thanks for the clarafication.

This is good news.

Indeed, very good news, a lot of people could be convinced to use CUDA, especially in physics and astronomy if double precision support was available.

This is so true. I gave a poster presentation at the American Physical Society for HOOMD and at least 1/2 the people that stopped by to talk had as their first question “But this is just single precision, right?” I would then spend the next 5 min trying to convince them (with quantitative data) that single precision is plenty good enough for Molecular Dynamics. Most of them remained unconvinced because in their minds, double precision is a magic bullet that solves everything.

Too few people understand that even double precision sucks, especially for iterative processes like HOOMD uses. See…view/reals.html for some nice examples.

I look forward to the double precision hardware so that I can implement HOOMD on it just to make this 1/2 of the physics community happy, even as I run my own research simulations in single precision ;)

Even knowing better, I made this mistake as well. An early version of my code used pseudo-double precision to accumulate a sum of a large number of floats. After 8 months, I finally tried single precision Kahan summation (thanks Simon Green for the suggestion to someone else!), and found it was enormously faster and worked just as well for my application. Being smart with single precision can be a lot more productive than being dumb with double precision. :)

mfactica, I apologize for being tedious, but are you saying that double precision will be natively supported in GeForce cards (as opposed to emulated)? If it is native, could you help me understand the distinction that Kirk is making between the ‘HPC space’ and the ‘consumer space’ (4th and 5th paragraphs on the second page of the interview):

“Consumers don’t need double precision,” he said. "It’s mainly going to be used in the HPC space to improve computational speed. As a result, we’re going to need to make the Tesla customers pay for that silicon on every chip until there is demand for it in the consumer space. That demand is kind of hard to imagine at the moment though.

“I can’t predict the future because I don’t know, but I would imagine that double precision will be supported across all products.”

Just as a quick question. Did you implement this summation in a reduction-type kernel? I have a kernel which does a big reduction at the end, and my results could use some more accuracy, so I was thinking about using Kahan, but have the idea it is non-trivial to implement in reduction. Did not yet look into detail, it’s on the TODO list that just keeps growing faster & faster ;)

No, it’s not used in a reduction in my kernel, but in some thread-local accumulators. So in that case it was pretty easy.

Ok, that is indeed not too hard (and with memory-bound kernels it will usually not take any extra time (only when just crossing a register-boundary for occupancy))

I can’t find a copy of this article for free anywhere at our university, but if you can get a copy, the abstract sounds like exactly what you want:

A thanks!, I’ll try through the university here next week, sounds indeed exactly what I need.

okay, if anyone is interested, I can give a small example of how the algorithm works. It is basically keeping a q-array next to your sum-array and doing a very-much-alike-Kahan summation step, taking two q’s in the first Kahan step instead of 1.