The role of thread block size on result accuracy

Hello all,

I would like to know if anyone can explain why reducing the number of threads per block in a kernel reduces the solution accuracy. I have ported an implicit solver in a high end unstructured CFD code where cache dominates. For this reason, reducing the thread block size to as low as 8 provides the fastest solution time, but unfortunately does not converge to the preset tolerance of 1.0E-12. Here are some scenarios with the minimum thread block sizes required for convergence with identical software, the CUDA code portion of which uses mixed precision

Case 1: Dual intel Xeon dual core workstation
1 core, 1 GTX 480 needs min 128 threads / block to converge
2 cores, 1 GTX 480 needs min 32 threads / block to converge
4 cores, 1 GTX 480 needs min 64 threads / block to converge

Case 2: AMD Athlon II Quad core Beowulf cluster
1 core, 1 GTX 470 will not converge for any thread block size
2 cores, 1 GTX 470 needs min 32 threads / block
2 cores, 2 GTX 470 needs min 32 threads / block
4 cores, 1 GTX 470 needs min 64 threads / block
4 cores, 2 GTX 470 needs min 64 threads / block
8 cores, 2 GTX 470 will not converge for any thread block size

In the cases that the the thread block size is below the min (or wont converge for any), the solution will converge to a reduced tolerance usually between 1.0E-7 and 1.0E-8. When it does converge to 1.0E-12, it will produce identical numbers for all thread block sizes above the minimum, which is the behavior I would expect. Node distributions are balanced pretty well across processors, and this code scales very well to thousands of cores, so no issues here. Can anyone provide some insight into this problem - why does the thread block size have an affect on result accuracy?

Problem Solved - I launched more blocks than should be necessary, don’t know why I should have to do this, but it works