As far as I understand, the GTX2xx series has 1 double precision FPU and 8 single precision FPUs per multiprocessor. I cannot seem to find any documentation of the double precision floating point instructional throughput in the CUDA 2.3 documentation. I am aware of pages 77-79 in the programming guide and pages 43-45 in the best practices guide. These documents list the single precision floating point throughput as 8 operations per clock cycle and state that compute capability 1.3 hardware supports native double precision calculations. The best practices guide alludes to the lower performance of double precision floating point calculations, stating that
However, I cannot find any reference to the actual number of double precision floating point operations per clock cycle.
I have three questions;
Am I correct in assuming that double precision operations are still organised into 32 thread wraps and that one thread from each wrap is processed in each clock cycle? I fail to see how the architecture benefits from being SIMD in double precision if this is the case.
Have I missed the place in the documentation that mentions the double precision floating point performance?
Is it possible to use both the 8 SP FPUs and the DP FPU in parallel capitalising on the extra hardware included?
At any rate, my application is memory bandwidth limited and as such would not benefit from any more double precision FPUs. In fact, the memory bandwidth and arithmetic throughput are better balanced in double precision than in single precision in my case. I am simply interested in learning more about the hardware. I have found this [topic=“70015”]post[/topic], detailing the specifications however I would prefer to have some official reference for use in my thesis!
Thanks in advance…