Number of 64 bit floating point operations per clock cycle?

EVP · July 7, 2014, 11:40pm

Hi.
According to the table “Number of Operations per Clock Cycle per Multiprocessor” from the CUDA C programming Guide each multiprocessor of my GPU (GTS 450) can do 4 “64-bit floatingpoint add, multiply, multiply-add” per clock cycle. What that really means?
Does it mean that each multiprocessor has just 4 64-bit ALUs?
If I have a block with 32 threads doing double precision floating point operations only 4 will execute in parallel and the others will have to wait until the next clock cycle?
Thanks.

Robert_Crovella · July 8, 2014, 1:25am

There are two versions of the GTS 450. The GTS 450 has 4 SMs. The GTS 450 OEM has 3 SMs. (a cc 2.1 SM has 48 cuda “cores”):

[url]http://www.techpowerup.com/gpudb/599/geforce-gts-450-oem.html[/url]

Yes, a cc 2.1 multiprocessor (SM) has 48 SP “units” and 4 DP “units” (you can call them ALUs if you want):

[url]Programming Guide :: CUDA Toolkit Documentation

Yes, that means each SM is able to “retire” up to 4 DP FMA instructions per cycle. A full warp of DP FMA operations would take at least 8 cycles to retire.

Looking at the entire chip, there are either 3 or 4 SMs, so it could retire up to 12 or 16 DP FMA instructions per cycle.

In general, with the exception of the Titan line, GeForce products are not principally designed to deliver double precision floating point performance. Ordinary DX or OGL graphics has no use for double-precision floating point. Furthermore, GTS 450 is a fairly low-end GPU in the previous “Fermi” generation of GPUs. “Kepler” GPUs (cc 3.0/3.5) have been shipping for about 2 years now. “Maxwell” GPUs (cc 5.0) just started shipping a few months ago.

DP = double precision = 64-bit = “double”
SP = single precision = 32-bit = “float”

EVP · July 8, 2014, 2:02am

Thanks!
I have some code that runs slower on the GPU than on the CPU for many reasons, like lots of conditional branches, dynamic memory allocation and double precision floating point operations, and I just want to explain why it’s so slow and why running it on the GPU is not a good idea.

Topic		Replies	Views
clock cycles of double operation CUDA Programming and Performance	9	5328	April 23, 2009
GTX2xx double precision support CUDA Programming and Performance	1	2029	October 16, 2009
How many float operations per cycle? CUDA Programming and Performance	3	4792	January 14, 2009
Single-Precision Floating-Point Basic Arithmetic Throughput CUDA Programming and Performance	2	4395	October 7, 2009
some detail-questions for a bachelor-thesis CUDA Programming and Performance	5	10516	December 4, 2010
How is 1/8 DP performance in GF-100 done? CUDA Programming and Performance	33	11404	November 7, 2010
How many operations per cycle? CUDA Programming and Performance	0	790	May 8, 2010
2 Small Questions CUDA Programming and Performance	3	1962	August 9, 2008
cuda and double-precision floating-point arithmetics CUDA Programming and Performance	3	1986	March 28, 2012
8800GTX:345GFlops or 518GFlops? CUDA Programming and Performance	8	9759	December 12, 2007

Number of 64 bit floating point operations per clock cycle?

Related topics