Hey guys, just wondering about CUDA architecture… I was reading the larrabee paper and I saw somethings from it that made me realize a few things about CUDA, It’s kinda speculation, maybe it’s a known fact, but makes perfect sense…

Well first (I’ve been known to be quite naive) I thought something like “WOW! my 9800 GTX has 16 OCTACORES?? THAT’S WHY I CAN RUN 128 THREADS!?” then I started to mature the idea with several variations like “Maybe it’s something like Intel’s Hyperthreading? 16cores with 8 ALUs?” and so on, I didn’t get how the I could run 128 independent threads =p notice I’m very noob at CUDA programming but still I was more fascinated about the architecture than what it could do with it =p

After reading the larrabee’s paper I started to notice somethigns, they state that with their 16-wide (bytes) VPU (vector processing unit) they could run 4 threads per processor, I was like “No you can’t…unless you have aditional ALUs/Cores you can’t…AH! OF COURSE” well… I realized that they were talking about SIMD with the name THREAD… I learned x86 ASM by playing with the SSE instructions, for those who don’t know SSE is an instruction set to work with 128 bits (16bytes…) registers by operating on 4byte (32bits) parts of the registers so it’s like:

If you have
Register1 = 10,20,30,40
Register2 = 40,30,20,10

You can (in one instruction) add all 4 elements independently

Register = (10+40),(20+30),(30+20),(40+10)

So you can work with those registers in parallel doing whatever you want, if you have 4 pieces of data that you’ll have to do the exact same thing with (like 4 RGBA pixels of an image) you can do it in parallel

ADD register1,register2
SUB register1,register2
CMP register1,register2
MUL register1,register2

And so on…probably taking advantage of the previous architecture, AMD’s 3DNow! and Intel’s MMX use FPU’s registers (instead of creating new ones) so I’m pretty sure you won’t be able to run 4 threads using SSE registers on Larrabee =p

Well, in some parts of the cuda documentation you have somethings that don’t make much sense (if you had 128 processors) like why I can’t put 1 thread of a wrap in hold and wait for the others to complete? why the only way to sync threads is with a barrier? if a thread branches it’ll start to execute in serial, then when they all come back to the same place they’ll again work in parallel right? well, why’s that?

Then it came to me, well… what I have here is a 4 core multiprocessor working with LARGE SIMD registers… it makes sense doesn’t it? shaders/CUDAcode usually run the exact same thing, so it would be easy to just put a BIG SIMD register and process that in parallel

So you have a register that has a size = warpsize and you have a set of instructions to loop/test/add/sub/etc on each of the individual elements of this big SIMD register in parallel, it’s no big deal but look this classic image from cuda documentation:

It says that you have MULTIPROCESSORS with PROCESSORS inside when actually I think you have (probably) a quad core (depending on the model) with big SIMD registers in place of the cores… that’s why you can only sync with barrier, that’s why you can’t branch keeping the parallelism and stuff like that, Larrabee will do the exact same thing (except that with more cores and smaller SIMD registers) so the idea of SIMT is actually an abstraction for SIMD. It’s a great architecture for sure, abstracting SIMD registers in threads is a BIG DEAL for us programmers, working with a (I dunno, 32warpsize*32bits registers/8bytes) 128bytes registers would be complicated without CUDA’s abstraction, I just think that it’s not what they state in the documentation

Well…that’s it! is this news? has anyone came up with a “theory” like that? It’s quite obvious I know but, I was quite “scared” when I came up with that =p


Seems yes. SIMT=SIMD+mask.
Though we don’t know exactly how nVidia implements SIMT. It looks pretty the same as Larrabee.
But who cares, as long as CUDA enables us to see GPU as SIMT, and it works well.

Yeah, the idea is the same, tho larrabee is supposed to have more “real” cores (allowing more branching with no loss of performance)

Sure, it’s just that the documentation says different, it explicitly shows a diagram stating that you have N multiprocessors with M cores each, and by what I’ve seen it’s not what happens, this brings up some other things, technically if it’s really SIMD the barrier instruction is useless: there’s would NEVER be any synchronization issues since they’re in fact using the same instruction and each operation would be executed in the EXACT same clock, even in branches CUDA documentation states that the threads will run serially until they return to the same code, some other thing that come to mind is the atomic operations which do not fallow the previous idea, some pros and cons to the SIMT == SIMD idea.

There is more in “SIMT” than in “SIMD”. The SIMT model provides the capability to switch between threads quickly, which is called SMT. In traditional SIMD, like the SSE units in Core 2 CPUs, however, there is no such thing. The execution of an instruction on the MP is SIMD, but MP also provides SMT. So actually the MPs are SMT+SIMD. I don’t know whether LRB will provide SMT, but I guess not. Anyway, the LRB cores are just P55 cores.

The barrier can be used to synchronize the whole block, not just the eight threads executing together on MP. So it is not as useless as you think.

Another thing to keep in mind is that these architectures oversubscribe the processors with threads to keep the pipeline full. Each stream processor in a CUDA chip can finish many instructions in one shader clock cycle (usually somewhere from 1 to 1.5 GHz). However, it can only achieve that throughput at those high clock rates with a very deep pipeline. An easy way to do that is to make the warp size 32, even though the number of stream processors per multiprocess is only 8. Four threads are queued up into each pipeline. Even then, you can still have read-after-write pipeline hazards unless you have 192 threads per block, which means you’ve queued up 24 identical threads per stream processor in the pipeline.

This is one reason why CUDA devices have hundreds of “processors” but reach maximum efficiency running thousands of threads.