Single vs double precision performance

t47dx · November 6, 2024, 2:40pm

I am working with HPC programs in C++ using OpenACC. I have a Ryzen 7-2700x with a GeForce 1660 card. For my particular application, the bulk of the computation consists of sampling and updating arrays (either 50x50 or 100x100, conclusions the same), and for those I have #pragma directives. The arrays live on the device and do not move back and forth. According to the specs, the card can provide up to a factor of 32 better performance using single precision than double precision. But the performance numbers are essentially identical in my case between single and double, and are in fact slower than running with the 16-thread CPUs with OpenMP. The double version probably should be slower than the CPUs, but the single results I don’t understand. I’ve spent considerable time trying to find bottlenecks, latency, etc., with no success so far. So instead, i’d like to ask if there is some sample code in a repository somewhere which, when run with OpenACC, will illustrate the single precision speedup. It will be easier for me to look at that code and see where mine differs in structure.

MatColgrove · November 6, 2024, 5:16pm

I’m not aware of any, but I typically use data center devices where the double precision performance isn’t an issue.

Have you used Nsight-Compute to see the time and instruction count on the FP units? It could be fairly small so why it doesn’t seem to matter. NCU might also be able to give you more clues about the performance.

My first guess would be that the program is under utilizing the GPU. While I don’t know details about a RTX 1660, I believe it has 22 SMs with each SM capable of running a max 2048 threads. So a problem size of 100x100 is probably using less than a quarter of the GPU.

Granted I don’t know your code, but that’s where I’d look to first.

-Mat

t47dx · November 6, 2024, 10:22pm

Thanks for the input. I had gotten the GPU capacity wrong. When I increase 100x100 to 1000x1000 and 5000x5000, there is a big speedup over the CPU and the double precision code takes about 50% longer. So there’s a trend, but not a factor of 32. I don’t know what sort of code inside loops give the best performance factor with GPUs.

MatColgrove · November 6, 2024, 11:12pm

Again, I don’t the details of an RTX 1660, but the factor of 32 is most likely referring to the performance of the instructions executed on the floating-point units or the ratio of single to double FPUs available on the device, not the overall performance.

Keep in mind with double precision, you’re also doubling the amount of data, so memory is also a factor.

Nsight-Compute is your best tool here. You can profile both the single and double precision versions, set one as the baseline, and the directly compare to see where the differences are.

Topic		Replies	Views
A question on single and double precision performance calculation with CUDA cores CUDA Programming and Performance	7	1826	May 31, 2024
Can a higher speedup be obtained for the sparse matrix-vector product by OpenACC? Legacy PGI Compilers	2	1312	July 19, 2019
Does the GTX1060 support double precision? CUDA Programming and Performance	4	11160	February 24, 2017
C1060 slower than S1070? CUDA Programming and Performance	5	754	January 17, 2011
Double Precision Help... Double precision CUDA Programming and Performance	6	5070	September 1, 2011
Double precision Accuracy with sqrt, log math functions Results on CPU & GPU are not exactly sam CUDA Programming and Performance	9	5422	April 12, 2012
OpenACC and CUFFT performance issues HPC CUDA Programming and Performance cuda , performance	1	379	December 1, 2023
Seeing odd results from Nsight Compute when testing OpenACC vs OpenMP nvc, nvc++ and nvfortran	3	609	September 3, 2023
GTX 280, CUDA and Double Precision CUDA Programming and Performance	15	16811	July 17, 2008
Use float rather than double in a kernel? CUDA Programming and Performance	9	1719	November 14, 2019

Single vs double precision performance

Related topics