I’m in the process of porting some existing C code to CUDA and developing on a Tesla C1060. I’m not sure if the problem I’m having is the one I think I’m having, but it appears to me that making use of shared double arrays causes some loss of precision.
I can’t post my entire kernel, but it works if it looks something like this:
double array; ... array[a * 5 + b] = /* some calculations */
Each thread having an array of 25 doubles is obviously a terrible idea, as they get pushed into local memory, so I’m attempting to give each thread 25 elements of cache through shared memory. For debugging purposes, I am running my kernel using only 1 block of 1 thread, so I feel that the following should work:
__shared__ double array; ... array[a * 5 + b] = /* the same calculations */
However, the end result of my program comes out looking like there was a rounding error. This is the only change that I’ve made, so I’m at a loss as to what could be going wrong.
Does anybody have any ideas?
EDIT: Through trial and error, I’ve discovered that my kernel produces the correct results when indices are calculated using b * 5 + a instead; I have absolutely no idea why this is the case.