some detail-questions for a bachelor-thesis

Presently I write my bachelor-thesis. It focuses on the usage of CUDA for parallel
computing. Beside the general description of GPGPU and the CUDA-architecture
I classify the new view in CUDA on the GPU as no pure stream-processing-machine,
but a more general parallel computing-model, that mainly corresponds to a special
type of an APRAM with subsets (of SIMD-subsubsets) being synchronizable and having
shared memory, called CUDAPRAM :)

At some points I have some questions that I want to ask here and would very much
appreciate if someone of you could answer some of them. They appear roughly in
decreasing order of estimated answer-complexity:

1.) How is NVIDIAs maximum GFLOPS calculated ?
By try and error I found the formula Shaderclock X number of multiprocessorcores X 3
matching the GeForce 8* series, e.g. 8800GT: 1512 MHz * 112 * 3 FLOPS = 508 GFLOPS
But where does the 3 come from ? Something like 2 for a MADD and 1 for another
parallel ALU-instruction ? And why need not to be taken into account that each
MADD/MUL takes 4 cycles - is there some kind of pipelining inside the ALUs ? This
leads more generally to question 2:

2.) Details on the ALUs ?
I cannot find good information on the ALUs (for MADD, MUL, etc.) and their connection
to the up to 16*8=128 multiprocessor-cores. Is there any document or can someone
give me a briefly overview ?

3.) Were multiple kernels in parallel possible some time ago ?
ok, the faq and several topcis as well and my own tests show the fact: “It is not
possible to run multiple kernels at once, even if they are in different streams or are
called from different processes.” But somewhere else I read (especially in a paper
that was written at my university), that some time ago with probably CUDA 0.8 it was
possible to launch multiple kernels in parallel. Is that true ? And what was the conflict
then to stop this feature ?

4.) double-support ?
In FAQ 18 there is: “NVIDIA GPUs supporting double precision in hardware will become
available in late 2007.” Okay, at the moment it’s still veeeery late 2007… Is there
another idea when doubles will be available, and if they will outperform AMDs
FireStream 9170 mentioned with peak 102 GFLOPS ?

5.) texture-size: 8kB or 16kB ?
The FAQ 14) tells us “texture cache has an effective 16KB working set size per
multiprocessor” The CUDA Programming Guide 1.1 tells “cache working set for texture
memory is 8 KB per multiprocessor” Is that an inconsistency in the numbers, or is
something different meant by ‘effective’ 16KB.

Thank you very much in advance !

A1: There’s been some argument about the GFLOPS in the forums, but the short version is: the NVIDIA marketing number optimistically includes FLOPS performed by the texture hardware to do linear interpolation. If you aren’t doing texture operations, the max GFLOPS is [stream processors] * [shader clock] * 2. (MAD performs a multiply and add at same time) The GFLOPS estimate does not capture the fact that the GPU also has hardware implementations of transcendental functions, like sin, cos, and exp.

And yes, the ALUs are pipelined. The size of a warp is 32, but the number of stream processors per multiprocessor is 8. This makes it sound like the 32 threads of the warp are pipelined in groups of 4 in each of the 8 stream processors. (A very nice setup, since you know you can’t have any pipeline hazards within the warp.)

A2: I’m not aware of any detailed references on the ALUs, though you could think of each stream processor as a fancy ALU which can fetch operands from thread-selected register file. Stream processors do not have their own program counters or instruction decoders, relying on the shared instruction decoder at the multiprocessor level. (See the figures in the programming guide.) This is why fine-grained branching has a large performance penalty.

A3: I don’t think launching multiple kernels was ever possible with CUDA. It has been possible (though I haven’t tried since 0.8) to have multiple processes issuing kernels at the same time, but the hardware interleaves their execution. As the number of multiprocessors per GPU grows with time, I hope we see this feature eventually.

A4: Still no double precision, and NVIDIA employees aren’t allowed to talk about future hardware.

A5: No idea, but I’d assume that is a bug with the FAQ.

Shader clock * number of ALUs * 2 : 8800 GTX (1 350e6 Hz * 128 ALU * 2 FLOP/ALU) = 345.6 GFLOP/s

The 2 FLOP/ALU is assuming a MADD on every clock. This number differs from the “marketing GFLOPS” which includes filtering from the texture unit in some undocumented way. If you look closely at Figure 1-1 in the CUDA programming guide, it uses the 340 GFLOP number.

According to the CUDA programming guide, there are 16 multiprocessors (8800 GTX) each with 8 ALUs.

I can only confirm that as far back as CUDA 0.8, multiple kernels at a time were not possible. This was a much requested feature back then and one NVIDIA comment to that request went something like this: “it may be possible in the hardware, but the software model would require some careful rethinking to allow this”. The implication of this statement was that the CUDA model of blocks and grids would need to be changed in some fundamental way to allow multiple kernels to execute at once. To my knowledge, there has been no official comment on this matter since then.

NVIDIA doesn’t comment on future hardware, so your guess is as good as mine. My guess is that we won’t have to wait too much longer to find out, as the G80 architecture has been milked for all its worth all the way up to G92 and the 9800 GTX. There has to be something new around the corner… Though, I have been expecting something new around the corner since last Christmas :)

The FAQ 14) tells us "texture cache has an effective 16KB working set size per

multiprocessor" The CUDA Programming Guide 1.1 tells "cache working set for texture

memory is 8 KB per multiprocessor"


Seems like an error in the FAQ. Most questions regarding inconsistencies have usually been answered by NVIDIA with “Trust the Programming Guide”.

This made me think back at some slides I saw a long time ago, in which it looked a MP had 16 SP’s. I couldn’t find those slides anymore, but found others from NVIDIA people :…%20hardware.ppt

Slide 5 says that there is indeed 16 kb cache per texture processing cluster, but such a cluster contains 2 multiprocessors, so on average there will be 8kb per multiprocessor.

Thank you all for your information-details! That clears things totally up for me. And for the “texture-size 8kB 16kB”-question that seems to be the answer, too. Great link for CUDA-hardware-details btw!

I have some doubt about the formula below.

Peak throughput = Shader clock * number of ALUs * 2

For C870, which is almost identical to 8800GTX except for global memory size and bandwidth.

Peak throughput = 1.35 GHz * 128 * 2 = 345.6 GFLOPS

However, NVIDIA somehow achieve 430 GFLOPS with CUDA programs. Where is the difference from?

There are 2 SFU’s in each SM. Assuming that each take 1 instruction every cycle, the FLOPS they contribute is

16 * 2 * 1 * 1.35 * 10^9

= 43.2 GLOPS

There is still a distance from 430GLOPS. Can anyone help please?

Seems like an error in the FAQ. Most questions regarding inconsistencies have usually been answered by NVIDIA with “Trust the Programming Guide”.