Hello. I’m writing a 7-points 3D stencil kernel.
My current approach is each block of size 32 x 16 x 1
calculates a sub-matrix of size 30 x 14 x HEIGHT
(x, y, z dimension respectively). Each block calculates level z
, then move on to level z + 1
, and so on.
#define TILE_X 32
#define TILE_Y 16
#define IN_TILE_X 30
#define IN_TILE_Y 14
#define HEIGHT 32
#define PRELOAD 3
#define PRELOAD_COUNT (1 << PRELOAD)
__global__ void stencil(float *input, float *output, float *weight, int N) {
int x = blockIdx.x * IN_TILE_X + threadIdx.x - 1;
int y = blockIdx.y * IN_TILE_Y + threadIdx.y - 1;
int z = blockIdx.z * HEIGHT;
float prev = 0, next = 0, nxt[1 << PRELOAD] = {0};
__shared__ float current[TILE_Y][TILE_X];
__shared__ float weights[7];
int index = z * N * N + y * N + x;
int N_squared = N * N;
if (y >= 0 && z >= 0 && x >= 0 && y < N && x < N) {
current[threadIdx.y][threadIdx.x] = input[index];
if (z + 1 < N) {
next = input[index + N_squared];
}
if (z - 1 >= 0) {
prev = input[index - N_squared];
}
} else {
current[threadIdx.y][threadIdx.x] = 0;
}
if (threadIdx.x < 7 && threadIdx.y == 0) {
weights[threadIdx.x] = weight[threadIdx.x];
}
#pragma unroll 1
for (int i = 0; (i < HEIGHT && z < N); i++) {
if ((i & (PRELOAD_COUNT - 1)) == 0) {
if (y < N && x < N && x >= 0 && y >= 0)
for (int j = 0; j < 1 << PRELOAD; j++) {
if (z + j + 2 < N) {
nxt[j] = input[index + N_squared * (2 + j)];
} else {
nxt[j] = 0;
}
}
}
__syncthreads();
if (threadIdx.y > 0 && threadIdx.x > 0 && threadIdx.y < TILE_Y - 1 && threadIdx.x < TILE_X - 1 && y < N && x < N) {
float x0 = fmaf(
weights[4],
current[threadIdx.y][threadIdx.x + 1],
fmaf(weights[2], current[threadIdx.y + 1][threadIdx.x], fmaf(weights[0], current[threadIdx.y][threadIdx.x], 0))
);
float x1 = fmaf(
weights[5],
prev,
fmaf(weights[3], current[threadIdx.y][threadIdx.x - 1], fmaf(weights[1], current[threadIdx.y - 1][threadIdx.x], 0))
);
output[index] = fmaf(weights[6], next, x0 + x1);
}
__syncthreads();
prev = current[threadIdx.y][threadIdx.x];
current[threadIdx.y][threadIdx.x] = next;
index += N_squared;
z++;
next = nxt[i & (PRELOAD_COUNT- 1)];
}
}
I’m benchmarking the performance of this kernel with a matrix of size 768 x 768 x 768
. The best performing value for HEIGHT
is around 32. Values smaller than 32 perform worse due to reduced arithmetic intensity, which makes sense. But what I don’t understand is once HEIGHT
get past 32, the performance consistently degrades. For example, average runtime on my device (RTX 3060):
HEIGHT = 32
: 12.8msHEIGHT = 128
: 16.2msHEIGHT = 256
: 17.8msHEIGHT = 768
: 18.2ms
Profiling this kernel with ncu
revealed that:
- The kernel achieve 100% occupancy across all values of
HEIGHT
. - Higher
HEIGHT
also caused more stalls, ~31.5 cycles/inst whenHEIGHT = 768
vs. ~20.5 cycles/inst whenHEIGHT = 32
. Also atHEIGHT = 768
ncu warns that each warp spent around 10 cycles waiting for other warps at the barrier. - Warp Stall Sampling figures are roughly the same.
From my understanding, another ready warp will take place of a stalling warp, and too small grid will hinder parallelism.
My question why does raising HEIGHT
cause performance degradation, even with 100% occupancy and relatively big grid. Thank you!!
Also here is the full code I used to benchmark the kernel.
stencil_test.txt (4.6 KB)