I use a kernel with blockNum=300, <<<…Ns=1kbytes>>>. this sums up to 300*1k = 300k > sharedMemorySize = 256k.
but in Emu mode (I’ve no real card currently), it runs out correct results.
1, Is this also valid in Device mode?
2, If 1 is yes, why? is there a swap mechanism that switches the shared memory of idle (swapped) blocks into global memory, like PCB context? isn’t that costly? thanks!
I don’t think I understand your problem.
If Ns=1kB then you will use exactly 1kB shared mem (assuming you don’t have additional static alloc or excessive register usage). This is always fine.
The 300 blocks will run sequentially (worst case) or in batches depending how many instances fit into the resources of your card. All using the 1kB shared mem. That is also fine.
since there are 16*16kbyte = 256kbytes shareMem in total, I wrongly assumed that each blocks statically eats 1kbytes of the sharedMem space, thus 300 blocks need more that 256kbytes, which is impossible and surprised me.
Now I guess that when a block is swapped out (not active in multiprocessor, i.e, no longer an instance), its varibles in sharedMem is also swapped into globalMem, to give place to other active instances. Hence no matter how many blocks i’m using, I can set kernel’s Ns= nearly 256k, without calculating the devision 256k/blockNum.
Ns=250k enables only 1 blocks at the same time, thus wastes a lot on swapping sharedMem varibles;
Ns=120k enables 2 blocks at the same time, also needs many swap time;
Ns=1k, blockNum=250k means no time wasted on swapping sharedMem varibles.
Is that right?
You mentioned an interesting point that if i use excessive registers, does cuda automatically (silently) use sharedMem for the overflowed register variables?
Thanks a lot!
When a block becomes active on a multiprocessor, it stays active there until it completes. The shared mem is never swapped in or out automatically – it is an explicitly user-managed memory.
The number of blocks that can be active on a multiprocessor is limited by the shared memory usage (and register usage) of each block. So if each block uses 8KB of shared memory, you can have at most 2 active at a time on a multiprocessor. If each uses 2KB, you can have at most 8.
You might ask… If each block runs to completion before any new block takes its place on the multiprocessor, why would you want to have multiple blocks active per multiprocessor? The answer is that when a you have a __syncthreads() after a load from global memory, all threads in the block must wait for all loads before the sync to complete. During this time, if other blocks are active on the multiprocessor and have threads that are not also waiting at a sync, then warps of threads from one of these active blocks can be run while the first block is syncing. Thus, having multiple blocks active per multiprocessor helps hide memory latency.
I think you misunderstand the idea of the shared memory. You cannot set Ns to be larger than 16k because this is the max for the 8800GTX. The shared memory content is simply discarded when the block finishes. There is no automatic read/write to device memory.
The number of blocks you are using is independent of the Ns setting. The block dim3 vector is unsigned short, so you should be able to do 64kx64kx64k blocks :ph34r:
The shared memory usage only rules how many blocks can be run simultaneously by one multiprocessor (for hiding latency and thread sync).
I got it, thank you very much!
I got it, thank you very much. I much want to try 64kx64kx64k blocks:)
The grid of blocks is only 2D, not 3D (but thread blocks themselves can be 3D), so you can’t actually launch 64k x 64k x 64k blocks. The most you can do is 64k x 64k blocks.
Right, sorry got carried away by the big numbers… :mellow:
It’s still big – 2^41 threads per kernel invocation max! Tera-thread computing. :)