-m32 and -m64 give different results of calculations What can be the reason ?

Hi All,

As I understand, -m32 and -m64 compilation affects size of pointers but the output of GPU calculations must not be affected. Is that true or I’m missing something ? In my case it seems that the output of kernel being compiled with -m32 and -m64 differs and this is strange and not expected behaviour.

Thanks in advance.

Can you narrow it down to a simple repro case that can be posted here?

Can you narrow it down to a simple repro case that can be posted here?

Will try to do it as it’s not that simple to extract the functionality … At this point the question is aimed to find out if there is something obvious to keep in mind when working with -m64 in order to have all correct.

Will try to do it as it’s not that simple to extract the functionality … At this point the question is aimed to find out if there is something obvious to keep in mind when working with -m64 in order to have all correct.

Are the results very different (suggesting some kind of sizeof()/alignment issue) or slightly different (suggesting some kind of float order of operations issue)?

I don’t know of a reason why -m64 would change the output on the GPU. On the CPU, some compilers use different internal registers and instructions (basically, switching from x87 to SSE) for floating point, which changes the rounding behavior. The GPU, however, should not do that. Can you look at the PTX from both cases and spot any obvious difference in the emitted code? (Yeah, that’s a huge pain and probably not illuminating if your kernel is long.)

Are the results very different (suggesting some kind of sizeof()/alignment issue) or slightly different (suggesting some kind of float order of operations issue)?

I don’t know of a reason why -m64 would change the output on the GPU. On the CPU, some compilers use different internal registers and instructions (basically, switching from x87 to SSE) for floating point, which changes the rounding behavior. The GPU, however, should not do that. Can you look at the PTX from both cases and spot any obvious difference in the emitted code? (Yeah, that’s a huge pain and probably not illuminating if your kernel is long.)

Finally, I have spotted the reason, it is not GPU but, as you said, CPU side. CPU code compiled for x64 handles the output from GPU differently. I know that CPU side is not the subject of this forum but, just in case, giving the code that reproduces the issue below, may be there will be some ideas.

float fConst = 1.4318620f;

float fValue1 = 40.598053f * (1.f - 1.4318620f / 100.f);

float fValue2 = 40.598053f * (1.f - fConst / 100.f);

MSVC 32

/fp:precise: fValue1 = 40.016743, fValue2 = 40.016747

MSVC 64

/fp:precise: fValue1 = 40.016743, fValue2 = 40.016743

The problem is that fValue2 is different. Is there a way to make them the same for both 32 and 64 platforms ?

Thanks in advance.

Finally, I have spotted the reason, it is not GPU but, as you said, CPU side. CPU code compiled for x64 handles the output from GPU differently. I know that CPU side is not the subject of this forum but, just in case, giving the code that reproduces the issue below, may be there will be some ideas.

float fConst = 1.4318620f;

float fValue1 = 40.598053f * (1.f - 1.4318620f / 100.f);

float fValue2 = 40.598053f * (1.f - fConst / 100.f);

MSVC 32

/fp:precise: fValue1 = 40.016743, fValue2 = 40.016747

MSVC 64

/fp:precise: fValue1 = 40.016743, fValue2 = 40.016743

The problem is that fValue2 is different. Is there a way to make them the same for both 32 and 64 platforms ?

Thanks in advance.

Your two fValue2 values are identical to 7 significant figures, which is as identical as you can get for single precision floating point arithmetic. The solution to your problem is to use a proper relative error metric to determine when your output is incorrect. (see http://docs.sun.com/source/806-3568/ncg_goldberg.html for more info)

The root cause of the difference has already been mentioned in this thread - different rounding characteristics between x87 and see instructions.

Your two fValue2 values are identical to 7 significant figures, which is as identical as you can get for single precision floating point arithmetic. The solution to your problem is to use a proper relative error metric to determine when your output is incorrect. (see http://docs.sun.com/source/806-3568/ncg_goldberg.html for more info)

The root cause of the difference has already been mentioned in this thread - different rounding characteristics between x87 and see instructions.

There are compiler specific ways to control the floating point code generated for the CPU. I have no experience with MSVC, so I can’t suggest anything but some quality time with Google.

There are compiler specific ways to control the floating point code generated for the CPU. I have no experience with MSVC, so I can’t suggest anything but some quality time with Google.

As I understand, there is no way to make the results alike (compiler options, for example) with guarantee ? In my case final result of complex calculations may differ for up to 50% when calculated in 32 and 64 bit application.

As I understand, there is no way to make the results alike (compiler options, for example) with guarantee ? In my case final result of complex calculations may differ for up to 50% when calculated in 32 and 64 bit application.

Well, those 50% deviations must be coming from some other bit of code than the one you listed above. One troubleshooting strategy you might employ is to output intermediate values during the computation and determine the point at which the values start to diverge significantly. This will help identify the root cause.

Another idea that comes to mind is that kernels compiled for 64-bit sometimes use additional registers. A block size you have chosen may exceed the hardware resources for registers. To identify this situation, add an option to your application where you perform the following error check after every single kernel call (optional as it reduces performance).

cudaThreadSynchronize()

cudaError_t err = cudaGetLastError()

// handle error message if err is not cudaSuccess

Well, those 50% deviations must be coming from some other bit of code than the one you listed above. One troubleshooting strategy you might employ is to output intermediate values during the computation and determine the point at which the values start to diverge significantly. This will help identify the root cause.

Another idea that comes to mind is that kernels compiled for 64-bit sometimes use additional registers. A block size you have chosen may exceed the hardware resources for registers. To identify this situation, add an option to your application where you perform the following error check after every single kernel call (optional as it reduces performance).

cudaThreadSynchronize()

cudaError_t err = cudaGetLastError()

// handle error message if err is not cudaSuccess

You can ask your compiler to generate SSE instructions in 32-bit mode too (/arch:SSE on VC++).

(This is advisable in any case, unless you really need your app to run on pre-Pentium 3 or pre-Athlon XP CPUs…)

You can ask your compiler to generate SSE instructions in 32-bit mode too (/arch:SSE on VC++).

(This is advisable in any case, unless you really need your app to run on pre-Pentium 3 or pre-Athlon XP CPUs…)

Already tried it - neither SSE nor SSE2 in 32 bit code change the situation.