I’m trying to replicate some Matlab code in Cuda in order to increase performance. I am using a Telsa C2050 card on windows 7 64bit. The matlab code does some simple calculations of frequency steps and simulation time and then computes the exponential function of a complex array. The problem i am getting is in the calculation of the adjusted simulation time. I have been able to trac the differences down to one calulation of the offset for the simulation time. This is simply 2 x range / speed_of_light; where range = 4.332077681840050e4, and speed of light = 3E8, both of these values are passed in to the Cuda routine. when i pass out the result of this calculation i get a differnce in the values. Can anyone explain why this occurs and how i can fix it? are there any compiler options that i need to be aware of for higher precision?
Cuda result
2.888051967602223e-004
Matlab result
2.888051787893366e-004
hand Calculated
2.888051787e-004
It looks very much like you are actually using floats rather than doubles to compute your figures.
Just for the sake of it, here is what you formula gives me on CPU using either doubles or floats:
double: 0.0002888051787893367 float: 0.0002888051967602223
So since I take from the title of your post that your code is actually using doubles, I guess you simply compiled it without mentioning any targeted architecture… This means that by default, you are compiling for compute capability 1.0, which doesn’t support double precision floating point arithmetic. This means that all your doubles are silently downcast to floats. But since your targeted hardware is C2050, by adding a -arch=sm_20 to your compiler options (well since I’m using Linux, I’m not so sure how to add this on Windows, or even if the compiler option is the same) to have you problem solved (hopefully).
HTH
Tesla C2050 is a Fermi device (don’t get confused by the irritating naming - Tesla is the name of both a GPU architecture and a product line). Compiling with [font=“Courier New”]nvcc -ptx -arch=sm_13 .cu[/font] works even though the architecture is wrong because it generates PTX code, which can be translated for the Fermi architecture at runtime.
If your card really is a Tesla C2050, using sm_20 should work as well (or better…).