Couple of questions:

- So from what I’ve read, shared mem access is supposed to be almost as fast as register access? Is this true?

If so, then I’m sure most are familiar with the matMul algorithm from the manual/sdk, which I’m modifying for a special use…

portion of original:

[codebox] // Csub is used to store the element of the block sub-matrix

```
// that is computed by the thread
float Csub = 0;
```

// Loop over all the sub-matrices of A and B

```
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
```

// Declaration of the shared memory array As used to

```
// store the sub-matrix of A
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
```

// Declaration of the shared memory array Bs used to

```
// store the sub-matrix of B
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
```

// Load the matrices from device memory

```
// to shared memory; each thread loads
// one element of each matrix
AS(ty, tx) = A[a + wA * ty + tx];
BS(ty, tx) = B[b + wB * ty + tx];
```

// Synchronize to make sure the matrices are loaded

```
__syncthreads();
```

// Multiply the two matrices together;

```
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);
```

// Synchronize to make sure that the preceding

```
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
```

// Write the block sub-matrix to device memory;

```
// each thread writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;[/codebox]
```

Which I’ve modified to the following (set1):

[codebox] **shared** float vals[BLOCK_SIZE][BLOCK_SIZE];

```
vals[ty][tx] = 0;
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
```

// Declaration of the shared memory array As used to

```
// store the sub-matrix of A
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
```

// Declaration of the shared memory array Bs used to

```
// store the sub-matrix of B
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
```

// Load the matrices from device memory

```
// to shared memory; each thread loads
// one element of each matrix
AS(ty, tx) = A[a + wA * ty + tx];
BS(ty, tx) = B[b + wB * ty + tx];
```

// Synchronize to make sure the matrices are loaded

```
__syncthreads();
```

// Multiply the two matrices together;

```
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
vals[ty][tx] += AS(ty, k) * BS(k, tx);
```

// Synchronize to make sure that the preceding

```
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
```

[/codebox]

I get a 2x worse runtime D:

However if I do this (set2):

[codebox] float Csub = 0;

```
__shared__ float vals[BLOCK_SIZE][BLOCK_SIZE];
vals[ty][tx] = 0;
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
```

// Declaration of the shared memory array As used to

```
// store the sub-matrix of A
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
```

// Declaration of the shared memory array Bs used to

```
// store the sub-matrix of B
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
```

// Load the matrices from device memory

```
// to shared memory; each thread loads
// one element of each matrix
AS(ty, tx) = A[a + wA * ty + tx];
BS(ty, tx) = B[b + wB * ty + tx];
```

// Synchronize to make sure the matrices are loaded

```
__syncthreads();
```

// Multiply the two matrices together;

```
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);
```

// Synchronize to make sure that the preceding

```
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
vals[ty][tx] = Csub;
__syncthreads();
// Write the block sub-matrix to device memory;
// each thread writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = vals[ty][tx];// Csub;[/codebox]
```

I get the original run-time.

So apparently a read/write sharedmem access is 2x slower than a register? Or is it not coalescing for some reason?

- Now, after compiling original/set2 with max. reg. at 32, I get a cubin with 14 reg. and 2084/3108 smem which is expected. However, if I set the max register count to 8 (trying to maximize warp occupancy), and get a cubin with reg = 8, but unchanged smem? How does this work?