I have one question about DP math and CUDA.
My kernel produces some array of 20 million doubles. When I change amount of threads per block I get slightly different results compared to some other threads/block setup.
I have dumped buffers for 128 thr/block and 256 thr/block in files as binary and compared them - there are differences - few thousand bytes out of 160 MB.
When I dumped those doubles as strings in files with 10 decimals both files (for 128 thr/block & 256 thr/block) were identical. But when I dumped them as 16 decimal digits differences appeared.
Results for fixed threads/block scenario are the same to the last decimal place.
Since calculations are independent of thread setup, I’m puzzled what is causing such slight DP precision changes ?
Do you guys have any idea what is causing this ?
Here are some thread configurations and hash results