Square root problems It will not calculate correctly.

blithist · July 2, 2009, 7:20pm

Hello,

I am trying to run the equation…

sqrt(

  (v1.x-v2.x)^2 +

  (v1.y-v2.x)^2 +

  (v1.z-v2.z)^2

)

and I know my code SHOULD work because I already had a working version of this code for an MPI program and have just reworked it to be a CUDA application, further I have also narrowed it down to that function because I removed that function from both the MPI and CUDA applications and the results are the same until I have my square root functions executed.

The only difference is that in my CUDA application it seemed to be easier to have my list of vectors be one big array rather than a multidimensional array merrily because it was easier, so it seemed when I started, to send a one dimensional array to the device than a two dimensional array. Anyway I have also checked to make sure that the calling of the individual members of the array are the same … which they are so it only leaves one function, the square root function.

I did a little research and found the following site… http://forums.nvidia.com/index.php?showtopic=50582 which told me to use the reciprocal square root function multiplied by the interior of the square root (rsqrt(x) * x) instead of the regular square root function. I have tried both ways, and both ways end up with incorrect results. Does anyone have any other thoughts?

avidday · July 2, 2009, 8:19pm

How does your actual code doing the calculation look and what is wrong with the answer it gives?

Nico · July 2, 2009, 8:58pm

Probably a silly question, but you do know that ^ is a logical xor-operator instead of a pow(), right? :)

N.

seibert · July 2, 2009, 9:15pm

Thankfully, nvcc won’t let you XOR anything with a float. :)

Nico · July 2, 2009, 9:30pm

That’s nice External Media Anyway, I believe it should be

sqrt(

(v1.x-v2.x)^2 +

(v1.y-v2.y)^2 +

(v1.z-v2.z)^2

)

N.

blithist · July 2, 2009, 10:52pm

Sorry I wasn’t clear I meant that as the general equations with the ^2 being the raised to second power. but the program code looks exactly the same except I did it like…

sqrt(

  (v1.x-v2.x)*(v1.x-v2.x) +

  (v1.y-v2.y)*(v1.y-v2.y) +

  (v1.z-v2.z)*(v1.z-v2.z)

)

And Nico I did mean for it to be (v1.y-v2.y)^2

seibert · July 2, 2009, 11:56pm

Ok, can you show your incorrect results compared to the expected result?

Sylvain_Collange · July 3, 2009, 11:39am

Try replacing sqrt with __fsqrt_rn . It will be much slower (until it’s implemented in hardware), but should give you the correct rounding. (x * rsqrt(x) should be slightly more accurate than sqrt(x) when x is a normal number, but still not correctly rounded.)

By the way, why do you need the results of your MPI and CUDA versions to match exactly? (not a rhetoric question, I’m actually interested in hearing the answer…)

blithist · July 3, 2009, 4:56pm

seibert I can show a partial output I would show the full output but I don’t think you want to look through 400 lines :)

Here are the correct results…

0.250,2138.611762554759

0.255,2092.868567279131

0.260,2058.309226301871

0.265,2016.717948844871

0.270,1955.39850723784

0.275,1869.225747429558

0.280,1760.6376794998

0.285,1637.854444507106

0.290,1511.987133687082

.

.

.

0.960,92.78303227132187

0.965,93.91643349781558

0.970,95.70534037574494

0.975,97.71544292934036

0.980,99.57838484108686

0.985,100.9984177869715

0.990,101.752963146927

0.995,101.6896222963883

1.000,100.7226676580754

1.005,98.83100363363878

.

.

.

1.200,114.0212254126888

1.205,111.2882085894538

1.210,108.9470922716362

1.215,107.0689548403182

1.220,105.6783678809785

1.225,104.7472445272642

1.230,104.1876497380735

1.235,103.8479534887385

1.240,103.5182667650906

1.245,102.9498676679108

Here are the incorrect results for that same domain…

0.250,86288.1562499999854481

0.255,80877.5937499999854481

0.260,75868.3437499999854481

0.265,71216.4687499999854481

0.270,66883.4062499999854481

0.275,62835.7656249999927240

0.280,59044.5156249999927240

0.285,55484.5156249999927240

0.290,52133.9843749999927240

.

.

.

0.960,-760.0603027343748863

0.965,-747.7209472656248863

0.970,-736.4636230468748863

0.975,-726.1794433593748863

0.980,-716.6721191406248863

0.985,-707.6682128906248863

0.990,-698.8312988281248863

0.995,-689.7785644531248863

1.000,-680.1022949218748863

1.005,-669.3859863281248863

.

.

.

1.200,313.1156005859374432

1.205,301.6966552734374432

1.210,288.4442138671874432

1.215,273.6068115234374432

1.220,257.4373779296874432

1.225,240.1913452148437216

1.230,222.1231079101562216

1.235,203.4838256835937216

1.240,184.5186157226562216

1.245,165.4644165039062216

So as you can see there is a major difference in the two…

The graph of the correct results should look like a roller-coaster and the results should never be negative.

Also Sylvain Collange I believe this should answer why I am looking for them to be the same :) its not that I need them to be exactly the same I mean a 10e-8 error would be fine by me but the numbers aren’t even close you know. Also It comes down to this is my first CUDA application and since I already had it written for MPI turning it into a CUDA application seemed like a simple way to start grasping some of the concepts even though I should probably be completely redesign it to a more CUDA like style program design but I think you get what I was trying to do, fast and simple to understand how CUDA works before I start diving into a program where I don’t know what the results should be.

Also I just tried using __fsqrt_rn and still that answer does not come out correctly…I feel as though this cannot be correct…

This is the code that works…

double p[ATOM_LENGTH][3];

.

.

.

double *p1=p[i];

double *p2=p[j];

.

.

.

216  double sq=(

217	(

218	  (p1[0]-p2[0]) *

219	  (p1[0]-p2[0])

220	) +

221	(

222	  (p1[1]-p2[1]) *

223	  (p1[1]-p2[1])

224	) +

225	(

226	  (p1[2]-p2[2]) *

227	  (p1[2]-p2[2])

228	)

229  );

230  return sqrt(sq);

All of these return the incorrect result…

double p[ATOM_LENGTH *3];

.

.

.

 74   double sq=(

 75	 (

 76	   (p[ATOM_LENGTH * 0 + i]-p[ATOM_LENGTH * 0 + j]) *

 77	   (p[ATOM_LENGTH * 0 + i]-p[ATOM_LENGTH * 0 + j])

 78	 ) +

 79	 (

 80	   (p[ATOM_LENGTH * 1 + j]-p[ATOM_LENGTH * 1 + j]) *

 81	   (p[ATOM_LENGTH * 1 + i]-p[ATOM_LENGTH * 1 + j])

 82	 ) +

 83	 (

 84	   (p[ATOM_LENGTH * 2 + i]-p[ATOM_LENGTH * 2 + j]) *

 85	   (p[ATOM_LENGTH * 2 + i]-p[ATOM_LENGTH * 2 + j])

 86	 )

 87   );

 88   return __fsqrt_rn(sq);

 89   // return sqrt(sq);

 90   // return rsqrt(sq) * sq;

Sorry for the length of this post… and for all the numbers in my code just copied it straight from vim…if you guys have any thoughts please let me know and if I am doing something really stupid please let me down easy :)

Thanks.

Nico · July 3, 2009, 5:11pm

This doesn’t look right:

double p[ATOM_LENGTH *3];

.

74 double sq=(

75 (

76 (p[ATOM_LENGTH * 0 + i]-p[ATOM_LENGTH * 0 + j]) *

77 (p[ATOM_LENGTH * 0 + i]-p[ATOM_LENGTH * 0 + j])

78 ) +

79 (

80 (p[ATOM_LENGTH * 1 + j]-p[ATOM_LENGTH * 1 + j]) *

81 (p[ATOM_LENGTH * 1 + i]-p[ATOM_LENGTH * 1 + j])

82 ) +

83 (

84 (p[ATOM_LENGTH * 2 + i]-p[ATOM_LENGTH * 2 + j]) *

85 (p[ATOM_LENGTH * 2 + i]-p[ATOM_LENGTH * 2 + j])

86 )

87 );

88 return __fsqrt_rn(sq);

89 // return sqrt(sq);

90 // return rsqrt(sq) * sq;

N.

Sylvain_Collange · July 3, 2009, 5:38pm

Oh, and you are using double-precision.
Forget about what I said, then. The double-precision sqrt is correctly rounded on the GPU.
(But dot products can behave slightly differently than on the CPU because of the FMA. Nothing that should cause such large differences, anyway.)

blithist · July 3, 2009, 5:51pm

I get the same results with floating point as well though…(and I am using the -arch sm_13)

blithist · July 3, 2009, 5:59pm

I feel like and idiot now…I won’t be able to check that for a little bit…Thanks for looking through it I’ll check that when I get back to my office.

Nico · July 3, 2009, 6:00pm

Did you correct the error I indicated in my previous post? You put a ‘j’ where it was supposed to be an ‘i’?

N.

blithist · July 3, 2009, 6:50pm

Nico as it turns out … I guess I cannot distinguish the difference between what an i and a j look like. You know how it is you have been looking at something for so long you don’t see the simple mistakes … thanks for pointing it out for me. Everything is working properly now. I can’t believe the difference in times from my MPI version to the CUDA … for a data set of 1000 atom coordinates it takes the MPI version 40 seconds and the CUDA 2.3 to complete. I mean I am only running it on a single GTX 285 and the CPU is a quad-core Intel 6600 that I am doing the MPI calculation on. I am already impressed by this CUDA setup and I haven’t even tried threading it to get both of my GTX 285’s hooked up. Cheers, thanks for the help.

Nico · July 3, 2009, 8:31pm

Glad to hear you got it working then, it’s usually the smallest bug that takes you the longest time to track down :)

I’ve been impressed by the speed at which the GPU can process data from before the advent of CUDA, when I was using OpenGL for my gpgpu programs, but CUDA makes it even more fun.

N.

Topic		Replies	Views
CUDA innacuracy? CUDA float produces different result from CPU float CUDA Programming and Performance	8	3066	September 9, 2011
Cuda fortran - sqrt() has different results when being called in a device code nvc, nvc++ and nvfortran	8	375	February 15, 2024
Correctly rounded rsqrt in double precision? CUDA Programming and Performance	10	2835	October 15, 2013
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8293	September 9, 2011
Unable to get correct square root results in kernel Legacy PGI Compilers	2	1559	April 5, 2018
Unacceptably low accuracy of calculations on the GPU CUDA Programming and Performance cuda	12	2049	May 15, 2024
Performance tweak for single-precision square root CUDA Programming and Performance	0	991	March 25, 2021
Cuda math produces a slightly different result to C++ math CUDA Programming and Performance	11	1511	August 31, 2022
Huge instruction stream for reciprocal on CC 2.0 reciprocal operation side effect? CUDA Programming and Performance	8	5453	September 13, 2011
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	10033	January 18, 2012

Square root problems It will not calculate correctly.

Related topics