Shared memory Is it context switched?

kuisma · December 5, 2007, 10:33am

I am somewhat confused over the CUDA architecture. If a thread/warp/block needs a slow memory access from the device memory, it can be masked by executing threads from an other block. This requires pre-emption of the current warp. But what about the shared memory? Is it a part of the block context and pre-emptied at the same time?

– Kuisma

javier1 · December 5, 2007, 11:58am

Shared memory is divided between blocks, as well as the registers are. If there is not enough shared memory (or registers) for an extra block at the same time in the same multiprocessor, it won’t be scheduled for concurrent execution. These are known as “split resources” in simultaneous multithreading.

kuisma · December 5, 2007, 12:19pm

But would not this require offline scheduling, i.e. a precalculated schedule? With non-deterministic memory accesses causing preemption , this would not be possible…?

MisterAnderson42 · December 5, 2007, 2:07pm

That scheduling can easily be done offline. Based on the register usage and shared memory usage, you (or the driver) can easily determine how many BLOCKS can run concurrently on a single multiprocessor. This number is constant because those resource usages are constant for each block. See the forumulas in the programming guide for the forumulas or download the occupancy calculator to play with it.
At the warp level, you need to imagine that after every single INSTRUCTION executed, another warp is likely to be swapped in. The latecny is much, much less with shared memory compared to global memory, but the same interleaving occurs as a result. Heck, there is even interleaving at the register level since values written to registers cannot be read for a few clocks (register read after write dependancies).

javier1 · December 5, 2007, 2:37pm

There is no preemption in the same way there is preemption in multi task environments. There is no context switch. It’s more like if the hardware could maintain several execution contexts simultaneously (in fact, as the resources are split, it maintains all the execution contexts; r0 is not the same register for thread0 than for thread1).

So the driver does static scheduling based on how many blocks can run at the same time in the GPU. If there are enough resources (registers+shared memory+“warp slots”+“block slots”), I just keep adding blocks to the schedule. The time is shared between threads in a fine-grained multithreading fashion (you can easily find “understandable” papers about coarse-grained, fine-grained and simultaneous multithreading so you can better understand how time is shared in GPUs).

I don’t know if I’m answering your question. I think you are misunderstanding how multithreading is actually performed by hardware… but may be I’m wrong. :blink:

kuisma · December 5, 2007, 4:12pm

I think I know what’s confusing me. :)

I must be wrong here. Memory reads can only be masked by the execution of threads already scheduled for the current multiprocessor, that is, very likely threads of the same block, right?

I got the (incorrect) impression that more blocks in the grid would help getting better performance during device memory accesses, and this would have required some kind of interblock scheduler, context switches etc. But blocks are more or less batch scheduled? Once (one or more) assigned to a multiprocessor, it (they) will use it exclusively until terminated?

Yes, you are right. I need to play around more.

wumpus · December 5, 2007, 4:21pm

Right.

Given the kernel properties (number of registers used, amount of shared memory used) it is determined how many blocks can execute on one multiprocessor at once, which is up to 8. This number is static during the entire kernel execution, so the hardware knows that when a block ends, it can start a new one.

So more blocks in a grid can get you better performance, if the kernel properties allow more blocks to be executed on a multiprocessor in parallel.

kuisma · December 5, 2007, 7:44pm

Ok, I think I’m enlightened. Thank you folks. :)

timtimac · December 6, 2007, 3:23am

Wumpus, so if I have overall 9 thread blocks for each multiprocessor to execute, but the shared memory usage allows only 5 thread blocks to be concurrently run on one multiprocessor, then

Question 1:

There would be no thread-block swapping and hence no context switching during the entire run-time of the 5 thread-block. Only when they are all terminated, the other 4 thread-block would get schedule to the multiprocessor. Is it correct?

Question 2:

What if one of the 5 thread-block running on the multiprocessor terminates earlier than the others (for some reason)? Would it possible that one of the rest 4 thread-block get scheduled onto the multiprocessor as long as one of the 4 thread-block is terminated?

Many thanks,

Timtimac

wumpus · December 6, 2007, 11:12am

I’m not sure what happens if some blocks take shorter to calculate than others. But I think it means that another block will get executed in its place. It’d be wasteful if the block would have to wait for all the others to complete. Someone of NVidia would have to comment on this, though.

Topic		Replies	Views
What happen to shared memory on block preemption CUDA Programming and Performance	3	1525	April 30, 2014
Kernel Execution issues related to Shared Memory CUDA Programming and Performance	5	5231	November 9, 2009
Execution Of Thread-Blocks CUDA Programming and Performance	4	5347	June 18, 2007
Shared memory per block Related to shared memory of an MCPU CUDA Programming and Performance	3	4077	August 14, 2007
shared memory access when multiple blocks CUDA Programming and Performance	5	5701	February 12, 2009
Amount of Shared Memory CUDA Programming and Performance	10	4381	June 3, 2010
So how much shared mem do we really have ? knowing cuda hw better = better optimization CUDA Programming and Performance	0	1950	November 20, 2009
Scheduling Blocks on a Multi-Processor Block Scheduling on Multiprocessor CUDA Programming and Performance	11	6545	December 6, 2007
mapping blocks to GPU SM's? CUDA Programming and Performance	5	12958	April 28, 2010
Shared memory : shared access CUDA Programming and Performance	4	2095	July 21, 2008

Shared memory Is it context switched?

Related topics