Interaction between CUDA++ and GPU Architecture

Hi all,

I’m new to this forum, so just in case anyone’s curious, I’m an applied math student just learning about parallel computing and its applications to scientific computing, so if anyone has any advice, info, or links to share, that would be great! I’m writing a very basic dense matrix multiply function, and I need help tuning it.

My question is this:
When you call a kernel, you stick a snippet of code looking like <<<nBlocks, blockSize>>> in the function call, so that a kernel is executed as an array of blocks, each containing some number of threads.
. I’m familiar with only the most basic GPU architecture i.e. a card has ~10-20 Multiprocessor/SMXs (I have a Kepler card), each with it’s own memory unit, with each SMX containing something like a number of cores, and each core capable of running a single sequential thread. So how does CUDA++ divide the blocks among the SMX (assuming I’m only doing flops)? And if I’m misunderstanding GPU architecture, please let me know!

Also, how does communication between memory work exactly? I move my data from my CPU cache/DRAM/Hard Drive to the GPU’s VRAM, then stuff gets divided to each SMX’s shared memory unit, then from there to the register. Is that correct?


Perhaps you confused “blocks” with “memory blocks”. What is ment in CUDA with “blocks” is a “subdivisions of the computations/threads into “thread blocks””.

So it’s more like a computational block. Having “said” that. The way the graphics card probably works is basically the same as a main ram <-> cpu system.

It simply fetches the memory from it’s main/vram when it’s asked for by your kernels/threads. It may do so “cache/line size” at a time… like a few bytes per memory consultation :) it may be 128 bits or 128 bytes, can’t remember exactly, consult the CUDA/C programming manual to learn how it exactly works. The gpu does have a little bit of cache… but usually it won’t be sufficient for big data and such and the huge ammounts of threads on bigger gpus and such.

Basically assume the vram is one big linear piece of memory, and your kernels can request memory addresses from it… I am not sure but I’d guess the SMXs might be bottlenecked by the memory bus… they’ll probably all share the same memory bus. So all their memory requests are probably queued and eventually serviced. The gpus have some “thread switching capabilities/storage” which means the gpu can switch to other threads/newer threads, while older threads are “stored away on chip” and waiting for memory requests to be serviced. “these are stalled threads waiting for memory for example”.

Anyway… I just had an idea which might increase the ammount of memory look ups that can be done. Instead of buying 1 expensive/hot nvidia graphics card, buy multiple cheaper ones/cooler ones and sli them… then each card has it’s own vram memory bus and perhaps this might give higher ammounts of memory look ups… could be interesting to compare such a setup with just a single card :)