I’m writing a 1D convolution kernel. This is my initial code
__global__ void conv1d(const float *A, const float* B, float *C, size_t N, size_t K) {
const int tid = threadIdx.x;
float res[COARSE_FACTOR] = {0};
const int index = tid + blockIdx.x * BLOCK_SIZE * COARSE_FACTOR;
const int COARSE_BLOCK_SIZE = BLOCK_SIZE * COARSE_FACTOR;
__shared__ float sA[COARSE_BLOCK_SIZE * 2], sB[COARSE_BLOCK_SIZE * 2];
const int offset = -((int)K / 2);
int a_index, b_index;
for (int i = 0; i < (K + COARSE_BLOCK_SIZE - 1) / COARSE_BLOCK_SIZE; i++) {
b_index = i * COARSE_BLOCK_SIZE + tid;
a_index = index + offset + i * COARSE_BLOCK_SIZE;
if (i == 0) {
for (int j = 0; j < COARSE_BLOCK_SIZE; j += BLOCK_SIZE) {
int base_index = j + tid;
int pos0 = a_index + j;
sA[base_index] = (pos0 >= 0 && pos0 < N) ? A[pos0] : 0;
}
for (int j = 0; j < COARSE_BLOCK_SIZE; j += BLOCK_SIZE) {
int base_index = j + tid + COARSE_BLOCK_SIZE;
int pos0 = a_index + j + COARSE_BLOCK_SIZE;
sA[base_index] = (pos0 >= 0 && pos0 < N) ? A[pos0] : 0;
}
} else {
for (int j = 0; j < COARSE_BLOCK_SIZE; j += BLOCK_SIZE) {
int base_index = tid + j;
sA[base_index] = sA[base_index + COARSE_BLOCK_SIZE];
int pos = a_index + COARSE_BLOCK_SIZE + j;
sA[base_index + COARSE_BLOCK_SIZE] = (pos < N) ? A[pos] : 0;
}
}
for (int j = 0; j < COARSE_BLOCK_SIZE; j += BLOCK_SIZE) {
int pos = b_index + j;
sB[tid + j] = (pos < K) ? B[pos] : 0;
}
__syncthreads();
for (int k = 0; k < COARSE_FACTOR; k++) {
for (int j = 0; j < COARSE_BLOCK_SIZE; j++) {
res[k] += sA[j + k * BLOCK_SIZE + tid] * sB[j];
}
}
__syncthreads();
}
for (int j = 0; j < COARSE_BLOCK_SIZE; j += BLOCK_SIZE) {
if (index + j < N)
C[index + j] = res[j / BLOCK_SIZE];
}
}
The code uses thread coarsening, where each thread calculates COARSE_FACTOR
results with stride BLOCK_SIZE
. I then modified the code so that each thread calculates COARSE_FACTOR
consecutive results:
// New code for calculating results
for (int k = 0; k < COARSE_FACTOR; k++) {
for (int j = 0; j < COARSE_BLOCK_SIZE; j++) {
res[k] += sA[j + k + COARSE_FACTOR * tid] * sB[j];
}
}
// New code for saving calculated results
for (int j = 0; j < COARSE_FACTOR; j++) {
int index = COARSE_BLOCK_SIZE * blockIdx.x + tid * COARSE_FACTOR + j;
if (index < N)
C[index] = res[j];
}
To my surprise, the modified code run around 50% faster on my device (RTX 3060).
Profiling two codes above with ncu
results in following:
Initial code:
Registers Per Thread register/thread 242
Block Limit Registers block 8
Block Limit Shared Mem block 12
Theoretical Active Warps per SM warp 8
Theoretical Occupancy % 16.67
Achieved Occupancy % 16.14
Achieved Active Warps Per SM warp 7.75
And the modified code:
Registers Per Thread register/thread 40
Block Limit Registers block 48
Block Limit Shared Mem block 20
Theoretical Active Warps per SM warp 16
Theoretical Occupancy % 33.33
Achieved Occupancy % 31.22
Achieved Active Warps Per SM warp 14.99
The modified code somehow reduces the register usage to 40 regs/thread, which also leads to increased occupancy. I also notice that if I increase the COARSE_FACTOR
in the modified code pass a certain threshold (10 on my device, which I think is dictated by the maximum shared memory per MP), the registers usage jump to 255 regs/thread.
Can someone explain what is this behavior? And where I can read more about it. Thank you.