Execution Of Thread-Blocks

Hi all,

Does that mean that blocks 0…7 have been finished when 8…15 are processed?

The reason for the quesion is: Shared memory is quite small compared to 768MB device memory. So I have to reuse the shared memory in my kernel over and over again. I have to make sure that data in shared memory doesn’t get overwritten when I still need it.

Is my assumption right that always 8 blocks can use the complete shared memory without problems?



I just saw that every block is processed by a single multiprocessor; and every multiprocessor hat its own shared memory.

So every block can use 16kb shared memory regardless how many blocks there are?

First case: 8 blocks are executed on 8 multiprocessors. So every block has its own shared memory. and

Second case: Always a bunch of 8 blocks are executed sequentially so the shared memory can be reused because the last bunch of 8 blocks have been finished.

Is that right?

That would also explain how the compatibility between current and future devices is possible even they have different parallel capabilities …





I don’t get it … :(

I think that each multiprocessor will take up to 8 blocks, but this is also limited by shared memory and registers. The undefined block issue order means that the driver will decide what order to run them in (there will be an algorithm that tries to maximize memory performance by scheduling blocks for long sequential reads / writes).

Once a block is loaded, it will run until it’s finished. A block will only be able to access the shared memory that was allocated to it, so there shouldn’t be any issues with it getting overwritten by another block (if it did, the GPU would be pretty much useless).

You might want to check out the CUDA occupancy calculator–it tells you how many blocks will fit on each multiprocessor for your application.

Yes I think you are right … I figuerd out that ever block has its own shared memory but I don’t need to take care about overlapping addresses and so on.

I just have to declare

extern __share__ foo char[];

at the top of my source and to use it in my kernel and every thread of a block can access the same shared memory but threads in different blocks see another memory. And depending on the size of the shared memory I’m using the compiler can bunch up to 8 blocks together to run on a single multiprozessor.

Is that right?



I think the shared memory always gets declared inside the kernel:

void myKernel( ... ) {

extern __shared__ <type> myDynamicSharedMemory[];

__shared__ <type> myStaticSharedMemory[ <size> ];



From a quick search, it looks like the dwtHaar1D SDK example has dynamic shared memory in it. This gets a little more complicated than static, as it only lets you declare one dynamic shared array per kernel (there’s only room for one Ns value in the execution parameters), so you have to hack in the addressing yourself.