Questions about shared memory and branching

superscalar · September 2, 2009, 10:59pm

Hi,
I am in the process of porting some C code to CUDA, have read the programming guide but have a couple of questions in mind before I start:

1 - In the guide’s example on using shared memory as a means of accelerating a matrix multiplication kernel, a two dimension shared memory array was allocated with the shared keyword. But since every thread executes that kernel, wouldn’t it fail to allocate the array at some stage since shared memory is very small? Or is an array declared as shared behave like a C array declared as static?

2 - On the subject of using flow control statements, the manual states that using such constructs should be avoided (and I understand why), and that branch prediction can be used to solve the problem of diverging threads within the same block. However, the guide states that there is a limit to how often branch prediction will be used. Why?
I will be working with very large matrices and will still need to make sure that threads don’t write out of bounds, presumably using an if statement will do the trick. But how can I make the compiler use branch prediction beyond the threshold given in the manual?
I have also thought about padding the matrix “borders” so that out of bounds threads don’t actually do any damage, however this can waste (precious) space and if the matrix I am working on is a sub-matrix of another larger one, then this method won’t work.

LSChien · September 3, 2009, 12:07am

1 - In the guide’s example on using shared memory as a means of accelerating a matrix multiplication kernel, a two dimension shared memory array was allocated with the shared keyword. But since every thread executes that kernel, wouldn’t it fail to allocate the array at some stage since shared memory is very small?

2 - On the subject of using flow control statements, the manual states that using such constructs should be avoided (and I understand why), and that branch prediction can be used to solve the problem of diverging threads within the same block. However, the guide states that there is a limit to how often branch prediction will be used. Why?

I will be working with very large matrices and will still need to make sure that threads don’t write out of bounds, presumably using an if statement will do the trick. But how can I make the compiler use branch prediction beyond the threshold given in the manual?

I have also thought about padding the matrix “borders” so that out of bounds threads don’t actually do any damage, however this can waste (precious) space and if the matrix I am working on is a sub-matrix of another larger one, then this method won’t work.

shared memory is shared by threads in a thread block (a thread block is executed in one multiprocessor)

in programming guide 2.3

page 73, each multiprocessor has on-chip memory of the four following types

One set of local 32-bit registers per processor
A parallel data cache or shared memory that is shared by all scalar processor cores

and is where the shared memory space resides

…

The number of blocks a multiprocessor can process at once â€“ referred to as the

number of active blocks per multiprocessor â€“ depends on how many registers per

thread and how much shared memory per block are required for a given kernel since

the multiprocessorâ€™s registers and shared memory are split among all the threads of

the active blocks.

page 107,

The shared qualifier, optionally used together with device, declares a variable that:

Resides in the shared memory space of a thread block,
Has the lifetime of the block,
Is only accessible from all the threads within the block

Question 1: since every thread executes that kernel, wouldn’t it fail to allocate the array

at some stage since shared memory is very small?

I take matrix multiplcation example in page 23~25 of programming guide 2.3 to explain

[codebox]define BLOCK_SIZE 16

global void MatMulKernel(Matrix A, Matrix B, Matrix C)

{

....

// Shared memory used to store Asub and Bsub respectively

shared float As[BLOCK_SIZE][BLOCK_SIZE];

shared float Bs[BLOCK_SIZE][BLOCK_SIZE];

…

}

void MatMul(const Matrix A, const Matrix B, Matrix C)

{

....

// Invoke kernel

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);

MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

}[/codebox]

I have a Tesla C1060 which has compute capability 1.3 and with following property

The number of registers per multiprocessor is 16384
The maximum number of active blocks per multiprocessor is 8;
The maximum number of active warps per multiprocessor is 32
The maximum number of active threads per multiprocessor is 1024
The amount of shared memory available per multiprocessor is 16 KB

Remark: property 3 is redudant since from property 4,

1024 / 32 (32 threads per warp) = 32 (maximum active warps)

in matrix multiplication example, there are two float matrices As, Bs

declared as share memory, each has 16 x 16 = 256 float elements.

So total size = 256 x 4 (4 bytes per float ) x 2 (two matrices) = 2 kB

number of threads per block = 16 x 16 x 1 = 256 (from code “dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE)”)

“The maximum number of active threads per multiprocessor is 1024” implies

maximum number of active blocks = 1024 / 256 = 4

this number also satisfies property 2 (“The maximum number of active blocks per multiprocessor is 8”)

Hence one multiprocessor has 4 active blocks, each block need 2kB shared memory,

total size of required shared memory = 2kB x 4 = 8 kB,

this number satisfies property 5 (“The amount of shared memory available per multiprocessor is 16 KB”)

question 2: On the subject of using flow control statements, the manual states that using such constructs

should be avoided (and I understand why), and that branch prediction can be used to solve the problem of

diverging threads within the same block

in fact, diverging only inside a warp, not a block

"Any flow control instruction (if, switch, do, for, while) can significantly

impact the effective instruction throughput by causing threads of the same warp to

diverge, that is, to follow different execution paths. If this happens, the different

executions paths have to be serialized, increasing the total number of instructions

executed for this warp. When all the different execution paths have completed, the

threads converge back to the same execution path"

Question 3: I will be working with very large matrices and will still need to make sure

that threads don’t write out of bounds

all what you have to do is just setting boundary conditions,

for example suppose you want to access a matrix A with dimension n1 x n2,

sya A(1:n1, 1:n2), then kernel function may be

[codebox]global void foo( float *A, int n1, int n2)

{

unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;

unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;

unsigned int index_in ; // index of A

// (yIndex, xIndex) = (i-1, j-1)

// require 1 <= i <= n1 and 1 <= j <= n2

if ( (xIndex < n2 ) && (yIndex < n1) ){

// index (i,j) of A(1:n1, 1:n2) is mapped to (i-1)*n2 + (j-1)

	index_in = yIndex * n2 + xIndex ;

	// process A(i,j) = A[index_in]

}

}

[/codebox]

superscalar · September 5, 2009, 2:22pm

Thank you LSChien, that was a very enlightening post.

Topic		Replies	Views
Branch divergence, Boundary element exchange Optimization and best practices CUDA Programming and Performance	9	18555	December 13, 2007
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6677	June 2, 2012
shared memory and CUDA calculator CUDA Programming and Performance	6	4040	October 26, 2008
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4490	October 24, 2008
__syncthreads and shared memory CUDA Programming and Performance	21	4292	June 15, 2011
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3309	January 10, 2010
Using Shared Memory in CUDA C/C++ Technical Blog	36	1959	October 8, 2020
Shared Memory questions CUDA Programming and Performance	6	10181	September 2, 2010
How can I configure this problem is it too big to fit in shared memory? CUDA Programming and Performance	7	3740	October 14, 2008
Trade offs between loading cost of loading to shared memory and working directly on global memory CUDA Programming and Performance	4	467	November 8, 2021

Questions about shared memory and branching

Related topics