trouble with "CUDA by Example" "dot" program

New to CUDA. Am trying to learn from “CUDA by Example” (Sanders and Kandrot). The dot product program they have in their chapter 5 is giving incorrect results on my platform. If N (the number of elements going into the product) is sufficiently large (more than 3594, for some reason), I can always find some number for “threadsPerBlock” and/or “blocksPerGrid” that make the GPU sum different from the result returned by the “sum_squares” function. I am staying well below the hardware limits for threads and blocks, so I don’t know what the problem could be. BTW, the difference between the two numbers is always some fairly large power of two. I get this problem in the exact downloaded code, no editing on my part. Any ideas what is going on? Thanks.

“no editing on my part”

“If N (the number of elements going into the product) is sufficiently large (more than 3594, for some reason), I can always find some number for “threadsPerBlock” and/or “blocksPerGrid” that make the GPU sum different from the result returned by the “sum_squares” function”

Aren’t those two excerpts contradictory?

I think it would be useful if you provided the actual code changes you did make, for which errors were observed.

The code is designed so that you can vary N without needing to modify the “threadsPerBlock” OR the “blocksPerGrid” parameter, and it will still work correctly.

But you can certainly break the code by choosing invalid settings for these parameters.

For example if you choose a non-power-of-2 number for threadsPerBlock, the code is expected to break.

The particular parallel reduction used in that code expects a power of 2 number of threads per block, and in fact is commented to that effect in the code:

// for reductions, threadsPerBlock must be a power of 2
    // because of the following code

If a power-of-two choice for threadblock size still produces an error, could you provide the exact changes you made to the code? It makes it much easier for those trying to help you.

I tried to make it clear, but maybe I didn’t: I get this error when I run the code straight out of the box, no changes at all. If I make N small enough, I can fiddle with some numbers and get rid of the error. Or I can fiddle some more and bring it back. Thanks.

One thing: There is one edit you have to make to see the problem. You have to change the output format and add space for at least 14 digits. The supplied output format doesn’t produce enough digits to show the problem.

OK I don’t see that. I grabbed the from chapter 5 here:

I modified it slightly to remove the dependencies on the boilerplate in book.h (I don’t think the changes I made matter.)

Then I compiled and ran it on RHEL 5.5/CUDA 6.5RC/Quadro5000. The session is here:

Can you go over the specifics of your platform (OS, GPU, CUDA version) and compile command?
What numerical results do you get for the unmodified code?

Perhaps try the code I posted in the link above.

“I grabbed the from …”
I got it from the same place.

"I modified it slightly to remove the dependencies … "
I have now done the same. My organization’s firewall won’t let me get to pastebin. I trust we are running codes that are alike in all relevant respects.

“What numerical results do you get for the unmodified code?”
I changed the output statement to:

printf( "Does GPU value %20.3f = %20.3f?\n", c,
             2 * sum_squares( (float)(N - 1) ) );

and the output given was:

Does GPU value   25723561574400.000 =   25723565768704.000?

BTW, this is absolutely unchanged from the result I get running the completely “raw” code. And you might notice, for what it’s worth, that the difference between the two numbers is 4194304, which is exactly 2^11.

“specifics of your platform … and compile command”
I compiled with nvcc, no flags. I am running on CentOS 5.10. GPU is a Tesla C2070, compute capability 2.0.

Thanks for working with me on this. I appreciate any ideas you might have.

Whoops. Forgot. CUDA version is 5.0.

So you did change the code. I’ll bet the printout from the unmodified program is identical for you between host and GPU. It is for me. (and the first 7 significant digits in your case match, so the ordinary float printout would probably match)

The discrepancy in the underlying data in the program is a limitation of the float data type. float has a limited number of mantissa bits. The largest integer quantity that can be exactly and reliably stored in a float is something like 16 million (23 mantissa bits, I think). Above that you are dealing with the potential for inexact storage, and round-off effects. This occurs not only on an absolute scale, but also at each intermediate addition or multiplication operation.

I don’t want to go into a dissertation on it here. Research floating point in more detail, or read this paper:

and you’ll get some idea of what may be happening. Change the underlying data type in the code to double (53 mantissa bits) or unsigned long long (64 “mantissa” bits), and I think you’ll see exact results between the two cases.

I forgot about the limited precision. So that clears it up. Many thanks.

In my own defense, I did say several comments back that I made the change to the output.

I was being goofy forgetting about the precision issue. But I have to say I wasn’t thinking about such fundamental things because I was naturally trusting the authors here. To me, this means that this code with “N” set to its original value of 33*1024 is a bad example. Or at least it should come with some kind of heavily emphasized qualifier in the book.

In any event, problem solved. Thanks for your time and help.