 # Square root problems It will not calculate correctly.

Hello,

I am trying to run the equation…

``````sqrt(

(v1.x-v2.x)^2 +

(v1.y-v2.x)^2 +

(v1.z-v2.z)^2

)
``````

and I know my code SHOULD work because I already had a working version of this code for an MPI program and have just reworked it to be a CUDA application, further I have also narrowed it down to that function because I removed that function from both the MPI and CUDA applications and the results are the same until I have my square root functions executed.

The only difference is that in my CUDA application it seemed to be easier to have my list of vectors be one big array rather than a multidimensional array merrily because it was easier, so it seemed when I started, to send a one dimensional array to the device than a two dimensional array. Anyway I have also checked to make sure that the calling of the individual members of the array are the same … which they are so it only leaves one function, the square root function.

I did a little research and found the following site… http://forums.nvidia.com/index.php?showtopic=50582 which told me to use the reciprocal square root function multiplied by the interior of the square root (rsqrt(x) * x) instead of the regular square root function. I have tried both ways, and both ways end up with incorrect results. Does anyone have any other thoughts?

How does your actual code doing the calculation look and what is wrong with the answer it gives?

Probably a silly question, but you do know that ^ is a logical xor-operator instead of a pow(), right? :)

N.

Thankfully, nvcc won’t let you XOR anything with a float. :)

That’s nice Anyway, I believe it should be

sqrt(

(v1.x-v2.x)^2 +

(v1.y-v2.y)^2 +

(v1.z-v2.z)^2

)

N.

Sorry I wasn’t clear I meant that as the general equations with the ^2 being the raised to second power. but the program code looks exactly the same except I did it like…

``````sqrt(

(v1.x-v2.x)*(v1.x-v2.x) +

(v1.y-v2.y)*(v1.y-v2.y) +

(v1.z-v2.z)*(v1.z-v2.z)

)
``````

And Nico I did mean for it to be (v1.y-v2.y)^2

Ok, can you show your incorrect results compared to the expected result?

Try replacing sqrt with __fsqrt_rn . It will be much slower (until it’s implemented in hardware), but should give you the correct rounding. (x * rsqrt(x) should be slightly more accurate than sqrt(x) when x is a normal number, but still not correctly rounded.)

By the way, why do you need the results of your MPI and CUDA versions to match exactly? (not a rhetoric question, I’m actually interested in hearing the answer…)

seibert I can show a partial output I would show the full output but I don’t think you want to look through 400 lines :)

Here are the correct results…

``````0.250,2138.611762554759

0.255,2092.868567279131

0.260,2058.309226301871

0.265,2016.717948844871

0.270,1955.39850723784

0.275,1869.225747429558

0.280,1760.6376794998

0.285,1637.854444507106

0.290,1511.987133687082

.

.

.

0.960,92.78303227132187

0.965,93.91643349781558

0.970,95.70534037574494

0.975,97.71544292934036

0.980,99.57838484108686

0.985,100.9984177869715

0.990,101.752963146927

0.995,101.6896222963883

1.000,100.7226676580754

1.005,98.83100363363878

.

.

.

1.200,114.0212254126888

1.205,111.2882085894538

1.210,108.9470922716362

1.215,107.0689548403182

1.220,105.6783678809785

1.225,104.7472445272642

1.230,104.1876497380735

1.235,103.8479534887385

1.240,103.5182667650906

1.245,102.9498676679108
``````

Here are the incorrect results for that same domain…

``````0.250,86288.1562499999854481

0.255,80877.5937499999854481

0.260,75868.3437499999854481

0.265,71216.4687499999854481

0.270,66883.4062499999854481

0.275,62835.7656249999927240

0.280,59044.5156249999927240

0.285,55484.5156249999927240

0.290,52133.9843749999927240

.

.

.

0.960,-760.0603027343748863

0.965,-747.7209472656248863

0.970,-736.4636230468748863

0.975,-726.1794433593748863

0.980,-716.6721191406248863

0.985,-707.6682128906248863

0.990,-698.8312988281248863

0.995,-689.7785644531248863

1.000,-680.1022949218748863

1.005,-669.3859863281248863

.

.

.

1.200,313.1156005859374432

1.205,301.6966552734374432

1.210,288.4442138671874432

1.215,273.6068115234374432

1.220,257.4373779296874432

1.225,240.1913452148437216

1.230,222.1231079101562216

1.235,203.4838256835937216

1.240,184.5186157226562216

1.245,165.4644165039062216
``````

So as you can see there is a major difference in the two…

The graph of the correct results should look like a roller-coaster and the results should never be negative.

Also Sylvain Collange I believe this should answer why I am looking for them to be the same :) its not that I need them to be exactly the same I mean a 10e-8 error would be fine by me but the numbers aren’t even close you know. Also It comes down to this is my first CUDA application and since I already had it written for MPI turning it into a CUDA application seemed like a simple way to start grasping some of the concepts even though I should probably be completely redesign it to a more CUDA like style program design but I think you get what I was trying to do, fast and simple to understand how CUDA works before I start diving into a program where I don’t know what the results should be.

Also I just tried using __fsqrt_rn and still that answer does not come out correctly…I feel as though this cannot be correct…

This is the code that works…

``````double p[ATOM_LENGTH];

.

.

.

double *p1=p[i];

double *p2=p[j];

.

.

.

216  double sq=(

217	(

218	  (p1-p2) *

219	  (p1-p2)

220	) +

221	(

222	  (p1-p2) *

223	  (p1-p2)

224	) +

225	(

226	  (p1-p2) *

227	  (p1-p2)

228	)

229  );

230  return sqrt(sq);
``````

All of these return the incorrect result…

``````double p[ATOM_LENGTH *3];

.

.

.

74   double sq=(

75	 (

76	   (p[ATOM_LENGTH * 0 + i]-p[ATOM_LENGTH * 0 + j]) *

77	   (p[ATOM_LENGTH * 0 + i]-p[ATOM_LENGTH * 0 + j])

78	 ) +

79	 (

80	   (p[ATOM_LENGTH * 1 + j]-p[ATOM_LENGTH * 1 + j]) *

81	   (p[ATOM_LENGTH * 1 + i]-p[ATOM_LENGTH * 1 + j])

82	 ) +

83	 (

84	   (p[ATOM_LENGTH * 2 + i]-p[ATOM_LENGTH * 2 + j]) *

85	   (p[ATOM_LENGTH * 2 + i]-p[ATOM_LENGTH * 2 + j])

86	 )

87   );

88   return __fsqrt_rn(sq);

89   // return sqrt(sq);

90   // return rsqrt(sq) * sq;
``````

Sorry for the length of this post… and for all the numbers in my code just copied it straight from vim…if you guys have any thoughts please let me know and if I am doing something really stupid please let me down easy :)

Thanks.

This doesn’t look right:

double p[ATOM_LENGTH *3];

.

.

.

74 double sq=(

75 (

76 (p[ATOM_LENGTH * 0 + i]-p[ATOM_LENGTH * 0 + j]) *

77 (p[ATOM_LENGTH * 0 + i]-p[ATOM_LENGTH * 0 + j])

78 ) +

79 (

80 (p[ATOM_LENGTH * 1 + j]-p[ATOM_LENGTH * 1 + j]) *

81 (p[ATOM_LENGTH * 1 + i]-p[ATOM_LENGTH * 1 + j])

82 ) +

83 (

84 (p[ATOM_LENGTH * 2 + i]-p[ATOM_LENGTH * 2 + j]) *

85 (p[ATOM_LENGTH * 2 + i]-p[ATOM_LENGTH * 2 + j])

86 )

87 );

88 return __fsqrt_rn(sq);

89 // return sqrt(sq);

90 // return rsqrt(sq) * sq;

N.

Oh, and you are using double-precision.
Forget about what I said, then. The double-precision sqrt is correctly rounded on the GPU.
(But dot products can behave slightly differently than on the CPU because of the FMA. Nothing that should cause such large differences, anyway.)

I get the same results with floating point as well though…(and I am using the -arch sm_13)

I feel like and idiot now…I won’t be able to check that for a little bit…Thanks for looking through it I’ll check that when I get back to my office.

Did you correct the error I indicated in my previous post? You put a ‘j’ where it was supposed to be an ‘i’?

N.

Nico as it turns out … I guess I cannot distinguish the difference between what an i and a j look like. You know how it is you have been looking at something for so long you don’t see the simple mistakes … thanks for pointing it out for me. Everything is working properly now. I can’t believe the difference in times from my MPI version to the CUDA … for a data set of 1000 atom coordinates it takes the MPI version 40 seconds and the CUDA 2.3 to complete. I mean I am only running it on a single GTX 285 and the CPU is a quad-core Intel 6600 that I am doing the MPI calculation on. I am already impressed by this CUDA setup and I haven’t even tried threading it to get both of my GTX 285’s hooked up. Cheers, thanks for the help.

Glad to hear you got it working then, it’s usually the smallest bug that takes you the longest time to track down :)

I’ve been impressed by the speed at which the GPU can process data from before the advent of CUDA, when I was using OpenGL for my gpgpu programs, but CUDA makes it even more fun.

N.