As I understand, -m32 and -m64 compilation affects size of pointers but the output of GPU calculations must not be affected. Is that true or I’m missing something ? In my case it seems that the output of kernel being compiled with -m32 and -m64 differs and this is strange and not expected behaviour.
Will try to do it as it’s not that simple to extract the functionality … At this point the question is aimed to find out if there is something obvious to keep in mind when working with -m64 in order to have all correct.
Will try to do it as it’s not that simple to extract the functionality … At this point the question is aimed to find out if there is something obvious to keep in mind when working with -m64 in order to have all correct.
Are the results very different (suggesting some kind of sizeof()/alignment issue) or slightly different (suggesting some kind of float order of operations issue)?
I don’t know of a reason why -m64 would change the output on the GPU. On the CPU, some compilers use different internal registers and instructions (basically, switching from x87 to SSE) for floating point, which changes the rounding behavior. The GPU, however, should not do that. Can you look at the PTX from both cases and spot any obvious difference in the emitted code? (Yeah, that’s a huge pain and probably not illuminating if your kernel is long.)
Are the results very different (suggesting some kind of sizeof()/alignment issue) or slightly different (suggesting some kind of float order of operations issue)?
I don’t know of a reason why -m64 would change the output on the GPU. On the CPU, some compilers use different internal registers and instructions (basically, switching from x87 to SSE) for floating point, which changes the rounding behavior. The GPU, however, should not do that. Can you look at the PTX from both cases and spot any obvious difference in the emitted code? (Yeah, that’s a huge pain and probably not illuminating if your kernel is long.)
Finally, I have spotted the reason, it is not GPU but, as you said, CPU side. CPU code compiled for x64 handles the output from GPU differently. I know that CPU side is not the subject of this forum but, just in case, giving the code that reproduces the issue below, may be there will be some ideas.
Finally, I have spotted the reason, it is not GPU but, as you said, CPU side. CPU code compiled for x64 handles the output from GPU differently. I know that CPU side is not the subject of this forum but, just in case, giving the code that reproduces the issue below, may be there will be some ideas.
There are compiler specific ways to control the floating point code generated for the CPU. I have no experience with MSVC, so I can’t suggest anything but some quality time with Google.
There are compiler specific ways to control the floating point code generated for the CPU. I have no experience with MSVC, so I can’t suggest anything but some quality time with Google.
As I understand, there is no way to make the results alike (compiler options, for example) with guarantee ? In my case final result of complex calculations may differ for up to 50% when calculated in 32 and 64 bit application.
As I understand, there is no way to make the results alike (compiler options, for example) with guarantee ? In my case final result of complex calculations may differ for up to 50% when calculated in 32 and 64 bit application.
Well, those 50% deviations must be coming from some other bit of code than the one you listed above. One troubleshooting strategy you might employ is to output intermediate values during the computation and determine the point at which the values start to diverge significantly. This will help identify the root cause.
Another idea that comes to mind is that kernels compiled for 64-bit sometimes use additional registers. A block size you have chosen may exceed the hardware resources for registers. To identify this situation, add an option to your application where you perform the following error check after every single kernel call (optional as it reduces performance).
cudaThreadSynchronize()
cudaError_t err = cudaGetLastError()
// handle error message if err is not cudaSuccess
Well, those 50% deviations must be coming from some other bit of code than the one you listed above. One troubleshooting strategy you might employ is to output intermediate values during the computation and determine the point at which the values start to diverge significantly. This will help identify the root cause.
Another idea that comes to mind is that kernels compiled for 64-bit sometimes use additional registers. A block size you have chosen may exceed the hardware resources for registers. To identify this situation, add an option to your application where you perform the following error check after every single kernel call (optional as it reduces performance).
cudaThreadSynchronize()
cudaError_t err = cudaGetLastError()
// handle error message if err is not cudaSuccess