Hello all,
I would like to know if anyone can explain why reducing the number of threads per block in a kernel reduces the solution accuracy. I have ported an implicit solver in a high end unstructured CFD code where cache dominates. For this reason, reducing the thread block size to as low as 8 provides the fastest solution time, but unfortunately does not converge to the preset tolerance of 1.0E-12. Here are some scenarios with the minimum thread block sizes required for convergence with identical software, the CUDA code portion of which uses mixed precision
Case 1: Dual intel Xeon dual core workstation
1 core, 1 GTX 480 needs min 128 threads / block to converge
2 cores, 1 GTX 480 needs min 32 threads / block to converge
4 cores, 1 GTX 480 needs min 64 threads / block to converge
Case 2: AMD Athlon II Quad core Beowulf cluster
1 core, 1 GTX 470 will not converge for any thread block size
2 cores, 1 GTX 470 needs min 32 threads / block
2 cores, 2 GTX 470 needs min 32 threads / block
4 cores, 1 GTX 470 needs min 64 threads / block
4 cores, 2 GTX 470 needs min 64 threads / block
8 cores, 2 GTX 470 will not converge for any thread block size
In the cases that the the thread block size is below the min (or wont converge for any), the solution will converge to a reduced tolerance usually between 1.0E-7 and 1.0E-8. When it does converge to 1.0E-12, it will produce identical numbers for all thread block sizes above the minimum, which is the behavior I would expect. Node distributions are balanced pretty well across processors, and this code scales very well to thousands of cores, so no issues here. Can anyone provide some insight into this problem - why does the thread block size have an affect on result accuracy?