Shared mem vs. registers

mrlinky · October 14, 2009, 7:26pm

Couple of questions:

So from what I’ve read, shared mem access is supposed to be almost as fast as register access? Is this true?

If so, then I’m sure most are familiar with the matMul algorithm from the manual/sdk, which I’m modifying for a special use…

portion of original:

[codebox] // Csub is used to store the element of the block sub-matrix

// that is computed by the thread

float Csub = 0;

// Loop over all the sub-matrices of A and B

// required to compute the block sub-matrix

for (int a = aBegin, b = bBegin;

         a <= aEnd;

         a += aStep, b += bStep) {

// Declaration of the shared memory array As used to

    // store the sub-matrix of A

    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Declaration of the shared memory array Bs used to

    // store the sub-matrix of B

    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from device memory

    // to shared memory; each thread loads

    // one element of each matrix

    AS(ty, tx) = A[a + wA * ty + tx];

    BS(ty, tx) = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are loaded

    __syncthreads();

// Multiply the two matrices together;

    // each thread computes one element

    // of the block sub-matrix

    for (int k = 0; k < BLOCK_SIZE; ++k)

        Csub += AS(ty, k) * BS(k, tx);

// Synchronize to make sure that the preceding

    // computation is done before loading two new

    // sub-matrices of A and B in the next iteration

    __syncthreads();

}

// Write the block sub-matrix to device memory;

// each thread writes one element

int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

C[c + wB * ty + tx] = Csub;[/codebox]

Which I’ve modified to the following (set1):

[codebox] shared float vals[BLOCK_SIZE][BLOCK_SIZE];

vals[ty][tx] = 0;

// Loop over all the sub-matrices of A and B

// required to compute the block sub-matrix

for (int a = aBegin, b = bBegin;

         a <= aEnd;

         a += aStep, b += bStep) {

// Declaration of the shared memory array As used to

    // store the sub-matrix of A

    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Declaration of the shared memory array Bs used to

    // store the sub-matrix of B

    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from device memory

    // to shared memory; each thread loads

    // one element of each matrix

    AS(ty, tx) = A[a + wA * ty + tx];

    BS(ty, tx) = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are loaded

    __syncthreads();

// Multiply the two matrices together;

    // each thread computes one element

    // of the block sub-matrix

    for (int k = 0; k < BLOCK_SIZE; ++k)

        vals[ty][tx] += AS(ty, k) * BS(k, tx);

// Synchronize to make sure that the preceding

    // computation is done before loading two new

    // sub-matrices of A and B in the next iteration

    __syncthreads();

}

[/codebox]

I get a 2x worse runtime D:

However if I do this (set2):

[codebox] float Csub = 0;

__shared__ float vals[BLOCK_SIZE][BLOCK_SIZE];

vals[ty][tx] = 0;

// Loop over all the sub-matrices of A and B

// required to compute the block sub-matrix

for (int a = aBegin, b = bBegin;

         a <= aEnd;

         a += aStep, b += bStep) {

// Declaration of the shared memory array As used to

    // store the sub-matrix of A

    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Declaration of the shared memory array Bs used to

    // store the sub-matrix of B

    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from device memory

    // to shared memory; each thread loads

    // one element of each matrix

    AS(ty, tx) = A[a + wA * ty + tx];

    BS(ty, tx) = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are loaded

    __syncthreads();

// Multiply the two matrices together;

    // each thread computes one element

    // of the block sub-matrix

    for (int k = 0; k < BLOCK_SIZE; ++k)

        Csub += AS(ty, k) * BS(k, tx);

// Synchronize to make sure that the preceding

    // computation is done before loading two new

    // sub-matrices of A and B in the next iteration

    __syncthreads();

}

vals[ty][tx] = Csub;

__syncthreads();

// Write the block sub-matrix to device memory;

// each thread writes one element

int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

C[c + wB * ty + tx] = vals[ty][tx];// Csub;[/codebox]

I get the original run-time.

So apparently a read/write sharedmem access is 2x slower than a register? Or is it not coalescing for some reason?

Now, after compiling original/set2 with max. reg. at 32, I get a cubin with 14 reg. and 2084/3108 smem which is expected. However, if I set the max register count to 8 (trying to maximize warp occupancy), and get a cubin with reg = 8, but unchanged smem? How does this work?

cbuchner1 · October 14, 2009, 8:43pm

The registers are swapped against local memory with --maxrregcount=8, not against shared memory.

The concept of coalescing does not apply to shared memory, instead read up on memory bank conflicts (yes, shared memory can be slower than registers, up to 15 fold).

Christian

mrlinky · October 14, 2009, 10:22pm

Thanks!

Err, can’t believe I missed the lmem ._. Would it be correct to suppose that manually declaring certain variables into shared memory would be faster, since local memory is apparently as slow as global memory?

As for the bank conflicts… are you saying up to 15 (16?) times slower because a single warp may have 16 conflicts at once, or is this in reference to some other factor? Does a register access take only 1 clock cycle for 32-bits, as opposed to 2 for shared? I can’t seem to get CUT_BANK_CHECKER or cutilCheckBankAccess to output anything in emu mode at the moment, even when I’m simulating a conflict… but just looking at the code, vals[ty][tx] shouldn’t really have conflicts since it accesses a 16x16 float array… or is it because it is taking vals[ty][tx] = AS(…)*BS(…) in 1 warp? Sorry if this is asking too much…

cbuchner1 · October 14, 2009, 10:26pm

Hmm, the bank checker may need a cutilExit(argc, argv); to know when to output its results.

Other than that I refer to the nVidia documentation about bank conflicts. It’s all about the banks accessed within each half-warp.

Christian

Topic		Replies	Views
Registers and Shared Memory CUDA Programming and Performance	18	4007	June 16, 2010
Shared memory problem CUDA Programming and Performance	10	4061	April 20, 2010
Basic Performance Ques.. ? from a non-CS .. noob CUDA Programming and Performance	3	2616	July 2, 2009
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	0	622	July 12, 2011
Local vs Shared Memory execution slows down when using shared memory CUDA Programming and Performance	6	3285	October 14, 2009
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	3	3567	July 12, 2011
Shared memory as slow as global memory CUDA Programming and Performance	8	4566	September 5, 2016
Newbie: __shared__, what am I doing wrong? shared memory seems to be slowing things down - why? CUDA Programming and Performance	2	1841	November 21, 2009
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1239	April 26, 2013
Extremely slow smem reads Possible bank conflicts? CUDA Programming and Performance	11	3613	November 25, 2008

Shared mem vs. registers

Related topics