the rtx GPU’double precision performance may be very weak(such as gtx 960?rtx 2080ti?),I want to improve the double precision performance in my gtx gpu(I only have the gtx gpu) ,how to solve the problem?

or can I calculate double precision data by using some single precision operations?(without lossing precision),thank you!

Double-precision computation requires double-precision arithmetic units in the GPU. There is only a small number of those in a consumer GPU, and you can’t change that number. There are no hidden units you could magically unlock with some trick. Typically the double-precision units of high-end consumer GPUs provide several hundred GFLOPS of throughput.

There are techniques you could try that use pairs of ‘float’ numbers to simulate double precision (after a fashion), providing almost the same precision, but limiting you to the same limited range as ‘float’. See this old answer of mine on Stack Overflow for references: https://stackoverflow.com/a/6770329/780717

There might be a ready-to-use software library somewhere that implements this double-float floating-point format but I am not aware of one at this time.

I may not express clearly above.

I want to calculate the funtion i my gpu(this is just a example)(kernel),and the rtx GPU’double precision performance may be very weak,I want to speed up the calculation,can I do something?(can I calculate double precision data by using some single precision operations)

ps:the parameters is double precision.

```
double test(double a,double b){
return a*a+b*b+a*b;
}
```

No, there isn’t any simple way to do that.

You either get the throughput available for double precision on your GPU, or you have to use some kind of library or set of other functions to try and break the double precision calculations into single precision, and there is no standard or simple way to do that, and in practice almost nobody does that.

What is actually pretty common, however, is for people to figure how to make their algorithm work with just single precision.

```
float test(float a,float b){
return a*a+b*b+a*b;
}
```

You cannot do that faster than the native double-precision functional units in the GPU if you want the *exact* same results.

You can likely get *approximately* the same results using computation on paired-float operands, see second paragraph of my previous post. And that *may* be faster. Note that your computation will have to be substantially more complex than what you show in your example to reap performance benefits, since you will have conversion overhead (from/to ‘double’) at the start and end of the computation which creates additional overhead.

As Robert Crovella points out, in various use cases there are ways to use single-precision computation instead of double precision, or single-precision computation for the bulk of the processing followed by double-precision refinement/cleanup (so called “mixed-precision computation”, you might want to Google for it).

thank you!

thank you!

Sorry to hijack the topic, but when I am on NVVP checking functions, in the shared memory report it will often suggest to use double precision to achieve twice the bandwidth.

I don’t know how much it relates to what the OP discussed, but maybe someone can enlighten me on how it works exactly or if it actually happens? Because if I don’t need the precision and float works fine result-wise, it will double the bandwidth (???) at the cost of using twice as much SHMEM, potentially limiting the number of blocks that could be active at a time.

This is referring to a kepler-specific feature. If you set 8-byte mode in Kepler, the achievable bandwidth to shared memory doubles. If you search around on kepler shared eight-byte-mode you’ll find various writeups.

Yes, so this feature I remember. However, the command that sets to 8-byte has no effect on Maxwell+, if I am not mistaken.

Does it mean that for Maxwell onwards it is already doing something under the hood and providing the benefit?

I recommend reading the white paper in the CUDA documentation called “Floating Point and IEEE 754” and also Goldberg’s paper which is reference number 5 in the white paper. Goldberg’s paper lists a few algorithms that can be used to retain accuracy and not have values rounded away. The papers also discuss the fused multiply add instruction that might be helpful to you.

Does it mean that for Maxwell onwards it is already doing something under the hood and providing the benefit?

Just to answer my own question, the last paragraph of Pascal Tuning Guide at 1.4.5.2 states:

“To simplify this, Pascal follows Maxwell in returning to fixed four-byte banks. This allows, all applications using shared memory to benefit from the higher bandwidth, without specifying any particular preference via the API.”