Execution Of Thread-Blocks

pototschnig · June 17, 2007, 7:52am

Hi all,

Does that mean that blocks 0…7 have been finished when 8…15 are processed?

The reason for the quesion is: Shared memory is quite small compared to 768MB device memory. So I have to reuse the shared memory in my kernel over and over again. I have to make sure that data in shared memory doesn’t get overwritten when I still need it.

Is my assumption right that always 8 blocks can use the complete shared memory without problems?

regards

Pototschnig

pototschnig · June 17, 2007, 8:15am

I just saw that every block is processed by a single multiprocessor; and every multiprocessor hat its own shared memory.

So every block can use 16kb shared memory regardless how many blocks there are?

First case: 8 blocks are executed on 8 multiprocessors. So every block has its own shared memory. and

Second case: Always a bunch of 8 blocks are executed sequentially so the shared memory can be reused because the last bunch of 8 blocks have been finished.

Is that right?

That would also explain how the compatibility between current and future devices is possible even they have different parallel capabilities …

regards

Pototschnig

edit

and

I don’t get it … :(

YetAnotherNoob · June 17, 2007, 10:45am

I think that each multiprocessor will take up to 8 blocks, but this is also limited by shared memory and registers. The undefined block issue order means that the driver will decide what order to run them in (there will be an algorithm that tries to maximize memory performance by scheduling blocks for long sequential reads / writes).

Once a block is loaded, it will run until it’s finished. A block will only be able to access the shared memory that was allocated to it, so there shouldn’t be any issues with it getting overwritten by another block (if it did, the GPU would be pretty much useless).

You might want to check out the CUDA occupancy calculator–it tells you how many blocks will fit on each multiprocessor for your application.

pototschnig · June 17, 2007, 11:18am

Yes I think you are right … I figuerd out that ever block has its own shared memory but I don’t need to take care about overlapping addresses and so on.

I just have to declare

extern __share__ foo char[];

at the top of my source and to use it in my kernel and every thread of a block can access the same shared memory but threads in different blocks see another memory. And depending on the size of the shared memory I’m using the compiler can bunch up to 8 blocks together to run on a single multiprozessor.

Is that right?

regards

Pototschnig

YetAnotherNoob · June 18, 2007, 7:11am

I think the shared memory always gets declared inside the kernel:

void myKernel( ... ) {

extern __shared__ <type> myDynamicSharedMemory[];

__shared__ <type> myStaticSharedMemory[ <size> ];

....

}

From a quick search, it looks like the dwtHaar1D SDK example has dynamic shared memory in it. This gets a little more complicated than static, as it only lets you declare one dynamic shared array per kernel (there’s only room for one Ns value in the execution parameters), so you have to hack in the addressing yourself.

Topic		Replies	Views
Shared memory per block Related to shared memory of an MCPU CUDA Programming and Performance	3	3986	August 14, 2007
Distribution of Threads to Multiprocessors CUDA Programming and Performance	8	13610	June 8, 2011
Kernel Execution issues related to Shared Memory CUDA Programming and Performance	5	5155	November 9, 2009
Quick block question! CUDA Programming and Performance	8	5357	September 1, 2009
Shared Memory and number of Blocks invoked CUDA Programming and Performance	4	5739	March 5, 2008
number of threads and registers CUDA Programming and Performance	10	4867	March 14, 2008
how to determine max number of blocks per kernel CUDA Programming and Performance	10	17225	September 11, 2011
maximum number of blocks CUDA Programming and Performance	3	2381	April 10, 2008
Not enough shared mem CUDA Programming and Performance	5	5770	November 3, 2009
Shared memory : shared access CUDA Programming and Performance	4	2022	July 21, 2008

Execution Of Thread-Blocks

Related topics