Hello,
I have one question about DP math and CUDA.
My kernel produces some array of 20 million doubles. When I change amount of threads per block I get slightly different results compared to some other threads/block setup.
I have dumped buffers for 128 thr/block and 256 thr/block in files as binary and compared them - there are differences - few thousand bytes out of 160 MB.
When I dumped those doubles as strings in files with 10 decimals both files (for 128 thr/block & 256 thr/block) were identical. But when I dumped them as 16 decimal digits differences appeared.
Results for fixed threads/block scenario are the same to the last decimal place.
Since calculations are independent of thread setup, I’m puzzled what is causing such slight DP precision changes ?
Do you guys have any idea what is causing this ?
Here are some thread configurations and hash results
32 thread/block
[*]4096x32 9a8245b0899750daa22d2523331c3d72cc0f7350
[*]8192x32 9a8245b0899750daa22d2523331c3d72cc0f7350
64 thread/block
[*]1024x64 d14316cd1a5f53f3b07041b2bb03b684fbb58aee
[*]8192x64 d14316cd1a5f53f3b07041b2bb03b684fbb58aee
128 thread/block
[*]2048x128 12ce58c340077083deb0544665ed37581c02f76a
[*]4096x128 12ce58c340077083deb0544665ed37581c02f76a
[*]8192x128 12ce58c340077083deb0544665ed37581c02f76a
256 thread/block
[*]1024x256 3dbac7ae03cbdbaf134a922be07c7a22c8e00e46
[*]2048x256 3dbac7ae03cbdbaf134a922be07c7a22c8e00e46
[*]4096x256 3dbac7ae03cbdbaf134a922be07c7a22c8e00e46
many thanks
Mirko