Questions about shared memory and branching

Hi,
I am in the process of porting some C code to CUDA, have read the programming guide but have a couple of questions in mind before I start:

1 - In the guide’s example on using shared memory as a means of accelerating a matrix multiplication kernel, a two dimension shared memory array was allocated with the shared keyword. But since every thread executes that kernel, wouldn’t it fail to allocate the array at some stage since shared memory is very small? Or is an array declared as shared behave like a C array declared as static?

2 - On the subject of using flow control statements, the manual states that using such constructs should be avoided (and I understand why), and that branch prediction can be used to solve the problem of diverging threads within the same block. However, the guide states that there is a limit to how often branch prediction will be used. Why?
I will be working with very large matrices and will still need to make sure that threads don’t write out of bounds, presumably using an if statement will do the trick. But how can I make the compiler use branch prediction beyond the threshold given in the manual?
I have also thought about padding the matrix “borders” so that out of bounds threads don’t actually do any damage, however this can waste (precious) space and if the matrix I am working on is a sub-matrix of another larger one, then this method won’t work.

shared memory is shared by threads in a thread block (a thread block is executed in one multiprocessor)

in programming guide 2.3

page 73, each multiprocessor has on-chip memory of the four following types

  1. One set of local 32-bit registers per processor

  2. A parallel data cache or shared memory that is shared by all scalar processor cores

and is where the shared memory space resides

The number of blocks a multiprocessor can process at once – referred to as the

number of active blocks per multiprocessor – depends on how many registers per

thread and how much shared memory per block are required for a given kernel since

the multiprocessor’s registers and shared memory are split among all the threads of

the active blocks.

page 107,

The shared qualifier, optionally used together with device, declares a variable that:

  1. Resides in the shared memory space of a thread block,

  2. Has the lifetime of the block,

  3. Is only accessible from all the threads within the block

Question 1: since every thread executes that kernel, wouldn’t it fail to allocate the array

at some stage since shared memory is very small?

I take matrix multiplcation example in page 23~25 of programming guide 2.3 to explain

[codebox]define BLOCK_SIZE 16

global void MatMulKernel(Matrix A, Matrix B, Matrix C)

{

....

// Shared memory used to store Asub and Bsub respectively

shared float As[BLOCK_SIZE][BLOCK_SIZE];

shared float Bs[BLOCK_SIZE][BLOCK_SIZE];

}

void MatMul(const Matrix A, const Matrix B, Matrix C)

{

....

// Invoke kernel

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);

MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

}[/codebox]

I have a Tesla C1060 which has compute capability 1.3 and with following property

  1. The number of registers per multiprocessor is 16384

  2. The maximum number of active blocks per multiprocessor is 8;

  3. The maximum number of active warps per multiprocessor is 32

  4. The maximum number of active threads per multiprocessor is 1024

  5. The amount of shared memory available per multiprocessor is 16 KB

Remark: property 3 is redudant since from property 4,

1024 / 32 (32 threads per warp) = 32 (maximum active warps)

in matrix multiplication example, there are two float matrices As, Bs

declared as share memory, each has 16 x 16 = 256 float elements.

So total size = 256 x 4 (4 bytes per float ) x 2 (two matrices) = 2 kB

number of threads per block = 16 x 16 x 1 = 256 (from code “dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE)”)

“The maximum number of active threads per multiprocessor is 1024” implies

maximum number of active blocks = 1024 / 256 = 4

this number also satisfies property 2 (“The maximum number of active blocks per multiprocessor is 8”)

Hence one multiprocessor has 4 active blocks, each block need 2kB shared memory,

total size of required shared memory = 2kB x 4 = 8 kB,

this number satisfies property 5 (“The amount of shared memory available per multiprocessor is 16 KB”)

question 2: On the subject of using flow control statements, the manual states that using such constructs

should be avoided (and I understand why), and that branch prediction can be used to solve the problem of

diverging threads within the same block

in fact, diverging only inside a warp, not a block

"Any flow control instruction (if, switch, do, for, while) can significantly

impact the effective instruction throughput by causing threads of the same warp to

diverge, that is, to follow different execution paths. If this happens, the different

executions paths have to be serialized, increasing the total number of instructions

executed for this warp. When all the different execution paths have completed, the

threads converge back to the same execution path"

Question 3: I will be working with very large matrices and will still need to make sure

that threads don’t write out of bounds

all what you have to do is just setting boundary conditions,

for example suppose you want to access a matrix A with dimension n1 x n2,

sya A(1:n1, 1:n2), then kernel function may be

[codebox]global void foo( float *A, int n1, int n2)

{

unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;

unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;

unsigned int index_in ; // index of A

// (yIndex, xIndex) = (i-1, j-1)

// require 1 <= i <= n1 and 1 <= j <= n2

if ( (xIndex < n2 ) && (yIndex < n1) ){

// index (i,j) of A(1:n1, 1:n2) is mapped to (i-1)*n2 + (j-1)

	index_in = yIndex * n2 + xIndex ;

	// process A(i,j) = A[index_in]

}

}

[/codebox]

Thank you LSChien, that was a very enlightening post.