Hi,
The following Kernel seems to run into trouble on the Tesla T10 GPU:
[codebox]
/--------------------------------------/
#define COUNT 256
global void findSum(numType *cu_result)
{
numType temp=0;
for (int i=0;i<COUNT;i++)
{
temp +=80189;
}
*cu_result=(temp/COUNT);
return;
};
/-----------------------------/[/codebox]
When numType is “float”, output is: 80188.8
When numType is “double” output is: -2.22507e-308
In both cases I call the kernel in the following manner:
[codebox]
numType *cu_result;
cudaMalloc((void**)&cu_result, sizeof(numType));
findSum<<<1,1>>>(cu_result);
[/codebox]
So there is just 1 block with 1 thread.
Am I doing something wrong?
Best,
Raghavan
I’ll bet you’re not compiling with the -arch sm_13 flag to enable double precision.
Indeed I was not! This takes care of the case where “numType” is “double”.
But the single precision problem still exists. I guess this really is a bug
in the hardware??
This seems to a single precision issue rather than a CUDA specific
issue. Never mind…
Nope, your Tesla (and your floats) are working just like they should be.
A float can’t hold integers larger than 2^24 exactly, so your sum has floating point truncation losses.
If you do the same accumulation loop on the CPU, you’ll get the identical 80188.8 result as the Tesla.
Floating point seems so simple, but it’s sometimes fiendish! A long but useful guide:
http://citeseerx.ist.psu.edu/viewdoc/summa…=10.1.1.22.6768
You can almost summarize it by saying “be careful adding numbers of different magnitudes, and don’t trust the difference between two numbers of the same magnitude.”
In this case, you’re doing the former by adding values like 80189.0 to numbers that are two orders of magnitude larger, like 20528330.0.