I am working with HPC programs in C++ using OpenACC. I have a Ryzen 7-2700x with a GeForce 1660 card. For my particular application, the bulk of the computation consists of sampling and updating arrays (either 50x50 or 100x100, conclusions the same), and for those I have #pragma directives. The arrays live on the device and do not move back and forth. According to the specs, the card can provide up to a factor of 32 better performance using single precision than double precision. But the performance numbers are essentially identical in my case between single and double, and are in fact slower than running with the 16-thread CPUs with OpenMP. The double version probably should be slower than the CPUs, but the single results I don’t understand. I’ve spent considerable time trying to find bottlenecks, latency, etc., with no success so far. So instead, i’d like to ask if there is some sample code in a repository somewhere which, when run with OpenACC, will illustrate the single precision speedup. It will be easier for me to look at that code and see where mine differs in structure.
I’m not aware of any, but I typically use data center devices where the double precision performance isn’t an issue.
Have you used Nsight-Compute to see the time and instruction count on the FP units? It could be fairly small so why it doesn’t seem to matter. NCU might also be able to give you more clues about the performance.
My first guess would be that the program is under utilizing the GPU. While I don’t know details about a RTX 1660, I believe it has 22 SMs with each SM capable of running a max 2048 threads. So a problem size of 100x100 is probably using less than a quarter of the GPU.
Granted I don’t know your code, but that’s where I’d look to first.
-Mat
Thanks for the input. I had gotten the GPU capacity wrong. When I increase 100x100 to 1000x1000 and 5000x5000, there is a big speedup over the CPU and the double precision code takes about 50% longer. So there’s a trend, but not a factor of 32. I don’t know what sort of code inside loops give the best performance factor with GPUs.
Again, I don’t the details of an RTX 1660, but the factor of 32 is most likely referring to the performance of the instructions executed on the floating-point units or the ratio of single to double FPUs available on the device, not the overall performance.
Keep in mind with double precision, you’re also doubling the amount of data, so memory is also a factor.
Nsight-Compute is your best tool here. You can profile both the single and double precision versions, set one as the baseline, and the directly compare to see where the differences are.