GPU's memory

Could you tell me, please? How much memory is allocated to each of the 96 processors (parallel) on nVidia 8800 GTS (320 MB)?

Thanks in advance.


there are 12 processors on that card, not 96, and you can use all the memory from each of the processors so this is not an issue at all

Sorry for crossposting.


(there are 96 processors!)

So, they may use all available memory (320 MB, 320/96 for each), don’t they?

The 96 is really more like a marketing stunt, they multiplied the number of multiprocessors with 8 to make the look higher than the number of shaders on older cards (and one multiprocessor can process 8 threads at once, hence the 8).

No, it is not a marketing stunt. True, older cards had 16 “pipelines” and each pipeline had dual vec4 ALUs while new cards have 16 “multiprocessors” and each multiprocessor has 8 scalar ALUs. However, the new ALUs are independent of one another and it shows in the big efficiency improvement.

“Processor” is a complex word. In fact, doesn’t it mean nowdays “a piece of silicon”? A Core 2 Duo is one processor and an 8800 is also probably “one processor”. But the number “128” is not a marketing invention. Ironically, the number “16” is. The actual multiprocessor count is 8 and each multiprocesor has 16 ALUs or two clusters of 8 ALUs or something like that.

96 streaming processors is not a “marketing” number. It is the actual number of streaming processors on a Quadro 4600 or 8800GTS, each processor executing a different CUDA thread. Streaming processors are grouped into multiprocessors for the purposes of using shared memory and efficient access to global memory, among others.

To answer the original question - the device memory (depending on the card, 320MB, 768MB, 1.5GB, and others) is not partitioned among the processors. Any CUDA thread, no matter which streaming processor is executing it, can access any device memory location. Shared memory is local to each multiprocessor (currently 16KB per multiprocessor).


It is a marketing invention in the way that you can grab any number that is a product of the number of ALUs, multiprocessors, threads, heck even blocks per multiprocessor and claim you have a lot of blabla units… while in practice it is a lot more complex and the number is pointless.

Only the multiprocessors can do something different at the same time (ie, execute a block). That is what I call a processor. Not on how many operands an instruction operates in parallel (ALUs)

Every number that’s marketed can be called a marketing number. Every number that’s produced should have asterisks attached to it.

I mean it’s true, a multiprocessor is just a single core with a very wide SIMD unit that emulates running many threads in parallel because each virtual thread gets its own element in the vector. But come on… that’s brilliant. The gains over old, dumb SIMD are huge. Besides, thread lockstepping wasn’t invented in the G80. It has always existed in GPUs. It’s their primary characteristic.

I don’t know… I mean I see your point how it’s an abuse of the word “processor” (or the word “ALU” if you say “128 ALUs”). On the other hand, the element-as-a-thread idea is huge and must be communicated. Language is much more ineffective than people commonly realize (especially when you try to be concise), and I can see how you often have to bend words to get the truth across. I’m fine with the statement “128 processors,” at least much more than I am with “500 Gflops.” (that’s a total lie)

Alex, any chance of explaining to me what is new regarding ‘the element as a thread’ idea over standard SIMD. You mentioned the gains over ‘old dumb SIMD are huge’ but I just can’t figure that one out. The way I see it old fashioned SIMD with flow predication registers do more or less the same job. What am i missing?