Hi,
I have source codes to solve 2-dimensional shallow water equation using C++ and Cuda 11.2.
I have tested the exe file on GTX970, GTX1080Ti, RTX2070, RTX2080, RTX2080Super.
The results of calculation difference with CPU are as below.
GPU ====> Max. value of calculation result difference with CPU
GTX970 ====> 0.001
GTX1080Ti ====> 0.001
RTX2070 ====> 0.204
RTX2080 ====> 0.261
RTX2080Super ====> 0.221
The question is "Why RTXs have bigger difference than GTXs ? "
Are there any build options for RTXs?
I use Visual studio 2019, Cuda Toolkit 11.2, and C++.
I applied sm_52 and compute_52( or sm_75 and compute_75), and “Use Fast Math : false” options.
My PC OS is Windows10 64bit.
Please help me.
calculation difference with CPU
This is not a valid measure of correctness or accuracy.
Since code generation has an architecture-dependent component, on both CPUs and GPUs, numerical differences can result. For example, the compiler may re-arrange code to increase instruction-level parallelism, or apply fusion of FMUL and dependent FADD into an FMA (fused multiply-add) differently, or some math functions may have slightly different accuracy within the stated accuracy bounds.
If you want to assess accuracy, the easiest (but not entirely foolproof) way is to compute reference results in (much) higher precision, then compare against that.
If you merely need the issue to go away, try compiling with -fmad=false
. Note: this will in all likelihood decrease both performance and accuracy of the GPU implementation.
Thanks njuffa,
You are right, it is not about accuracy.
So I have changed the title of this issue.
Previous title : RTX shows lower floating point accuracy than GTX?
New title: RTX shows bigger calculation difference with CPU than GTX?
I think its unlikely that anyone can give you a complete answer unless either you provide a complete code example, or someone decides to write a lengthy tutorial for you.
The only “build options” specific to RTX that come to mind are the ones you already mentioned: it’s always good practice to compile for the architecture(s) you intend to run on, e.g. compute_75/sm_75 for a cc7.5 GPU.
Here is a brief tutorial. I would say there are two general categories of issues here.
-
Those arising from machine calculation methods. For example comparing float
to double
, or with or with FMA contraction, are a couple examples. Such differences can easily give rise to differences in results, whether we are talking about comparing CPU to GPU or some other comparison. In addition, other more complex operations such as sin(), cos(), and other “library” match functions may simply be implemented differently on two different “machines” giving rise to different results.
-
Those arising from order of operations (algorithm calculation methods). Floating point operations don’t always have all the characteristics of basic math operations that we learned about in grade school/middle school. If you compare a serial algorithm/realization to a parallel algorithm/realization, its often the case that the math doesn’t get done in exactly the same order. This can give rise to differences. A possible item to read here is this floating point whitepaper
Because the two GPUs you mention generally have different sizes (e.g. differing numbers of SMs, for example) a nice ninja-tuned parallel reduction that scopes out the size of the GPU being run on, then uses e.g. a grid-stride loop to size the grid to match the GPU, and then doing a parallel reduction, will likely give at least slightly different results depending on the GPU it is run on.
I doubt that is actually the issue in your case; I don’t know what the issue is in your case. You’re welcome to discuss it further, but I’m unlikely to provide any further response unless you provide a short, complete example that demonstrates the issue.
There are many questions on various forums that fit this general description (CPU/GPU differences), here is one example. Yes, I’m aware that one doesn’t specifically focus on GPU-GPU differences. Here is one that discusses GPU-GPU results differences, in a general way. I’m sure you can find others.
Thank you, Robert_Crovella
I will review my codes carefully about your comments.
And if I have any positive results, I will let you know.
Thank you, Robert_Crovella.
I have reviewed my codes in detail, and it seems I have found the solution.
The summaries of my review are as follows.
- __syncthreads() function acted more strictly in RTX than GTX.
(e.g. At a __syncthreads() in a “if” statement, RTX was hung but GTX was not.)
- GTX was more sensitive on the value of thread per block (TPB).
When the TPB was changed, the calculation results from GTX were less sensitive than RTX.
(** My calculation includes iteration process (implicit method).)
- When I applied larger value of TPB(e.g. 32, 64, 128, 256, 512), the calculation results from RTX became closer to the results using CPU. I think, this may be, larger TPB can sync more control volumes in a block even though cudaDeviceSynchronize() is not applied .
(** The control volume is minimum calculation unit when using finite volume method (FVM).)
- When I applied TPB as 256 or a larger value to RTXs and 32 to GTXs, the calculation results from all the GPUs were very close to the results using CPU.
(** In my first posting of May 27, TPB 32 was applied to all RTXs and GTXs.)
Again, thank you for your comments and encouraging.