Shared mem vs. registers

Couple of questions:

  1. So from what I’ve read, shared mem access is supposed to be almost as fast as register access? Is this true?

If so, then I’m sure most are familiar with the matMul algorithm from the manual/sdk, which I’m modifying for a special use…

portion of original:

[codebox] // Csub is used to store the element of the block sub-matrix

// that is computed by the thread

float Csub = 0;

// Loop over all the sub-matrices of A and B

// required to compute the block sub-matrix

for (int a = aBegin, b = bBegin;

         a <= aEnd;

         a += aStep, b += bStep) {

// Declaration of the shared memory array As used to

    // store the sub-matrix of A

    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Declaration of the shared memory array Bs used to

    // store the sub-matrix of B

    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from device memory

    // to shared memory; each thread loads

    // one element of each matrix

    AS(ty, tx) = A[a + wA * ty + tx];

    BS(ty, tx) = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are loaded

    __syncthreads();

// Multiply the two matrices together;

    // each thread computes one element

    // of the block sub-matrix

    for (int k = 0; k < BLOCK_SIZE; ++k)

        Csub += AS(ty, k) * BS(k, tx);

// Synchronize to make sure that the preceding

    // computation is done before loading two new

    // sub-matrices of A and B in the next iteration

    __syncthreads();

}

// Write the block sub-matrix to device memory;

// each thread writes one element

int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

C[c + wB * ty + tx] = Csub;[/codebox]

Which I’ve modified to the following (set1):

[codebox] shared float vals[BLOCK_SIZE][BLOCK_SIZE];

vals[ty][tx] = 0;

// Loop over all the sub-matrices of A and B

// required to compute the block sub-matrix

for (int a = aBegin, b = bBegin;

         a <= aEnd;

         a += aStep, b += bStep) {

// Declaration of the shared memory array As used to

    // store the sub-matrix of A

    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Declaration of the shared memory array Bs used to

    // store the sub-matrix of B

    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from device memory

    // to shared memory; each thread loads

    // one element of each matrix

    AS(ty, tx) = A[a + wA * ty + tx];

    BS(ty, tx) = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are loaded

    __syncthreads();

// Multiply the two matrices together;

    // each thread computes one element

    // of the block sub-matrix

    for (int k = 0; k < BLOCK_SIZE; ++k)

        vals[ty][tx] += AS(ty, k) * BS(k, tx);

// Synchronize to make sure that the preceding

    // computation is done before loading two new

    // sub-matrices of A and B in the next iteration

    __syncthreads();

}

[/codebox]

I get a 2x worse runtime D:

However if I do this (set2):

[codebox] float Csub = 0;

__shared__ float vals[BLOCK_SIZE][BLOCK_SIZE];

vals[ty][tx] = 0;

// Loop over all the sub-matrices of A and B

// required to compute the block sub-matrix

for (int a = aBegin, b = bBegin;

         a <= aEnd;

         a += aStep, b += bStep) {

// Declaration of the shared memory array As used to

    // store the sub-matrix of A

    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Declaration of the shared memory array Bs used to

    // store the sub-matrix of B

    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from device memory

    // to shared memory; each thread loads

    // one element of each matrix

    AS(ty, tx) = A[a + wA * ty + tx];

    BS(ty, tx) = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are loaded

    __syncthreads();

// Multiply the two matrices together;

    // each thread computes one element

    // of the block sub-matrix

    for (int k = 0; k < BLOCK_SIZE; ++k)

        Csub += AS(ty, k) * BS(k, tx);

// Synchronize to make sure that the preceding

    // computation is done before loading two new

    // sub-matrices of A and B in the next iteration

    __syncthreads();

}

vals[ty][tx] = Csub;

__syncthreads();

// Write the block sub-matrix to device memory;

// each thread writes one element

int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

C[c + wB * ty + tx] = vals[ty][tx];// Csub;[/codebox]

I get the original run-time.

So apparently a read/write sharedmem access is 2x slower than a register? Or is it not coalescing for some reason?

  1. Now, after compiling original/set2 with max. reg. at 32, I get a cubin with 14 reg. and 2084/3108 smem which is expected. However, if I set the max register count to 8 (trying to maximize warp occupancy), and get a cubin with reg = 8, but unchanged smem? How does this work?

The registers are swapped against local memory with --maxrregcount=8, not against shared memory.

The concept of coalescing does not apply to shared memory, instead read up on memory bank conflicts (yes, shared memory can be slower than registers, up to 15 fold).

Christian

Thanks!

Err, can’t believe I missed the lmem ._. Would it be correct to suppose that manually declaring certain variables into shared memory would be faster, since local memory is apparently as slow as global memory?

As for the bank conflicts… are you saying up to 15 (16?) times slower because a single warp may have 16 conflicts at once, or is this in reference to some other factor? Does a register access take only 1 clock cycle for 32-bits, as opposed to 2 for shared? I can’t seem to get CUT_BANK_CHECKER or cutilCheckBankAccess to output anything in emu mode at the moment, even when I’m simulating a conflict… but just looking at the code, vals[ty][tx] shouldn’t really have conflicts since it accesses a 16x16 float array… or is it because it is taking vals[ty][tx] = AS(…)*BS(…) in 1 warp? Sorry if this is asking too much…

Hmm, the bank checker may need a cutilExit(argc, argv); to know when to output its results.

Other than that I refer to the nVidia documentation about bank conflicts. It’s all about the banks accessed within each half-warp.

Christian