2 Small Questions

  1. According to the Programming Guide double type gets demoted to float when compiled to run on device with cp < 1.3. Double type in emulation mode is compiled to run on the host, and stays double?

  2. What I could gather from random information is that each Streaming Multiprocessor contains 1 double precision unit. Making it 30 for the 280 GTX. The unit is capable of issuing 1 double instruction pr cc (correct me if I am wrong as I am very uncertain about this).

What I am wondering is: What happens in a kernel containing


a = b+c in double precision

when the 8 scalar processor reaches the instruction adding the two operands of double type?

I’m not sure about the first question.

For your second question, you are correct that there is only 1 double floating point unit per SMP, bringing the totat count to 30 for the GTX 280. While this may seem unfortunate, it does make sense since these units take up a lot of space. When you have 8 threads which use the double floating point unit, I imagine that these operations will be serialized, which make cause a performance hit. Hopefully you have enough threads and other operations going on that the cost of serialization can be hidden. It might be worth taking a little time to establish what needs to be 64-bit, and what can get by in 32-bit.

I had the same gut feeling, but it would be nice to have it confirmed. Assuming a MAD-instruction stays 1 cc in double precision you get a peak ~78 gigaflops/s in double precision. If they get serialized hopefully it does not induce any overhead. What’s the going rate on the cpu’s. I believe the CELL had around 20 gigaflop sustained (from the shaky back of my head).

Any with a 1.3 card that might know the anwser(s) ?