Presently I write my bachelor-thesis. It focuses on the usage of CUDA for parallel
computing. Beside the general description of GPGPU and the CUDA-architecture
I classify the new view in CUDA on the GPU as no pure stream-processing-machine,
but a more general parallel computing-model, that mainly corresponds to a special
type of an APRAM with subsets (of SIMD-subsubsets) being synchronizable and having
shared memory, called CUDAPRAM :)
At some points I have some questions that I want to ask here and would very much
appreciate if someone of you could answer some of them. They appear roughly in
decreasing order of estimated answer-complexity:
1.) How is NVIDIAs maximum GFLOPS calculated ?
By try and error I found the formula Shaderclock X number of multiprocessorcores X 3
matching the GeForce 8* series, e.g. 8800GT: 1512 MHz * 112 * 3 FLOPS = 508 GFLOPS
But where does the 3 come from ? Something like 2 for a MADD and 1 for another
parallel ALU-instruction ? And why need not to be taken into account that each
MADD/MUL takes 4 cycles - is there some kind of pipelining inside the ALUs ? This
leads more generally to question 2:
2.) Details on the ALUs ?
I cannot find good information on the ALUs (for MADD, MUL, etc.) and their connection
to the up to 16*8=128 multiprocessor-cores. Is there any document or can someone
give me a briefly overview ?
3.) Were multiple kernels in parallel possible some time ago ?
ok, the faq and several topcis as well and my own tests show the fact: “It is not
possible to run multiple kernels at once, even if they are in different streams or are
called from different processes.” But somewhere else I read (especially in a paper
that was written at my university), that some time ago with probably CUDA 0.8 it was
possible to launch multiple kernels in parallel. Is that true ? And what was the conflict
then to stop this feature ?
4.) double-support ?
In FAQ 18 there is: “NVIDIA GPUs supporting double precision in hardware will become
available in late 2007.” Okay, at the moment it’s still veeeery late 2007… Is there
another idea when doubles will be available, and if they will outperform AMDs
FireStream 9170 mentioned with peak 102 GFLOPS ?
5.) texture-size: 8kB or 16kB ?
The FAQ 14) tells us “texture cache has an effective 16KB working set size per
multiprocessor” The CUDA Programming Guide 1.1 tells “cache working set for texture
memory is 8 KB per multiprocessor” Is that an inconsistency in the numbers, or is
something different meant by ‘effective’ 16KB.
Thank you very much in advance !