# of multiprocessors still more silly stuff to ask

Always thought G80 is 8(6) 16x SIMD with 1.35 (1.2) GHz.

CUDA programming guide says:

So is there 816kb or 1616kb shared ram?

Greetings

     Knax

On 8800 GTX there are 16 16kb shared memory regions, one per multiprocessor.

On 8800 GTS there are 12.

Mark

On page 49 of the programming guide it says that there are 2 clockcycles needed to process 32 threads on the 8 processors per multiprocessor. This makes sense if the clockfrequency is twice as high internally.

However on page 51, it is said that 2 clockcycles are needed for a floating point operation. Is this at 650 Mhz or the double frequency ? For 650 Mhz I can’t understand how the observed 350 FLOPS on page 1 are calculated.

It takes 2 clock cycles for a float operation per warp of 32 threads. All clock cycles referred to in the programming guide refer to the 675 MHz instruction clock (on GeForce 8800 GTX). Yes, the 'actual" clock in the hardware is 2X that rate (as advertised) and instructions are multi-pumped.

Because instruction decode operates on 32-thread warps, it made more sense for us to talk about things in terms of the 675MHz clock and 2 cycles per warp rather than the 1350MHz clock and 4 cycles per warp.

16 Multiprocessors * 8 processors / multi * 2 flops / MAD * 1 MAD / processor-cycle * 1.35 GHz = 345.6 GFLOP/s

Alternatively:

16 Multiprocessors * 8 processors / multi * 2 flops / MAD * 2 MAD / processor-cycle * .675 GHz = 345.6 GFLOP/s

Mark

With multipumped you mean pipelined ? Thus the latency of a MAD is 4 processor cycles and the throughput is 1 processor cycle (@ 1.35 Ghz) ?

This is not easily understood from the programming guide.

Perhaps multipumped was the wrong word. I just meant that each warp takes multiple cycles to process.

Mark