Hi,
I am in the process of porting some C code to CUDA, have read the programming guide but have a couple of questions in mind before I start:
1 - In the guide’s example on using shared memory as a means of accelerating a matrix multiplication kernel, a two dimension shared memory array was allocated with the shared keyword. But since every thread executes that kernel, wouldn’t it fail to allocate the array at some stage since shared memory is very small? Or is an array declared as shared behave like a C array declared as static?
2 - On the subject of using flow control statements, the manual states that using such constructs should be avoided (and I understand why), and that branch prediction can be used to solve the problem of diverging threads within the same block. However, the guide states that there is a limit to how often branch prediction will be used. Why?
I will be working with very large matrices and will still need to make sure that threads don’t write out of bounds, presumably using an if statement will do the trick. But how can I make the compiler use branch prediction beyond the threshold given in the manual?
I have also thought about padding the matrix “borders” so that out of bounds threads don’t actually do any damage, however this can waste (precious) space and if the matrix I am working on is a sub-matrix of another larger one, then this method won’t work.
I have a Tesla C1060 which has compute capability 1.3 and with following property
The number of registers per multiprocessor is 16384
The maximum number of active blocks per multiprocessor is 8;
The maximum number of active warps per multiprocessor is 32
The maximum number of active threads per multiprocessor is 1024
The amount of shared memory available per multiprocessor is 16 KB
Remark: property 3 is redudant since from property 4,
1024 / 32 (32 threads per warp) = 32 (maximum active warps)
in matrix multiplication example, there are two float matrices As, Bs
declared as share memory, each has 16 x 16 = 256 float elements.
So total size = 256 x 4 (4 bytes per float ) x 2 (two matrices) = 2 kB
number of threads per block = 16 x 16 x 1 = 256 (from code “dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE)”)
“The maximum number of active threads per multiprocessor is 1024” implies
maximum number of active blocks = 1024 / 256 = 4
this number also satisfies property 2 (“The maximum number of active blocks per multiprocessor is 8”)
Hence one multiprocessor has 4 active blocks, each block need 2kB shared memory,
total size of required shared memory = 2kB x 4 = 8 kB,
this number satisfies property 5 (“The amount of shared memory available per multiprocessor is 16 KB”)
question 2: On the subject of using flow control statements, the manual states that using such constructs
should be avoided (and I understand why), and that branch prediction can be used to solve the problem of
diverging threads within the same block
in fact, diverging only inside a warp, not a block
"Any flow control instruction (if, switch, do, for, while) can significantly
impact the effective instruction throughput by causing threads of the same warp to
diverge, that is, to follow different execution paths. If this happens, the different
executions paths have to be serialized, increasing the total number of instructions
executed for this warp. When all the different execution paths have completed, the
threads converge back to the same execution path"
Question 3: I will be working with very large matrices and will still need to make sure
that threads don’t write out of bounds
all what you have to do is just setting boundary conditions,
for example suppose you want to access a matrix A with dimension n1 x n2,
sya A(1:n1, 1:n2), then kernel function may be
[codebox]global void foo( float *A, int n1, int n2)
{
unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;
unsigned int index_in ; // index of A
// (yIndex, xIndex) = (i-1, j-1)
// require 1 <= i <= n1 and 1 <= j <= n2
if ( (xIndex < n2 ) && (yIndex < n1) ){
// index (i,j) of A(1:n1, 1:n2) is mapped to (i-1)*n2 + (j-1)