Hi Forum,
I came across an interesting phenomenon when writing double buffer pipeline code: I implemented several code version which are supposed to be equivalent, but after profiling I found that they are compiled to different asm code and results in different runtime. Would you please help me understand why the compiler treat those implementation differently?
The code is to first of all load something from global to shared memory, and then do some computation:
LDG2S(As, Bs)
MATMUL(As, Bs)
and the main idea is to use some extra buffer (As[0], Bs[0] and As[1], Bs[2]) to hide the runtime. My implementations are as follows:
implementation A use an register to indicate which shared memory to work on:
LDG2S(As[0], Bs[0])
unsigned int pipeline_indicator = 0;
for (int i_step=0; i_step<K/16-1; ++i_step) {
__syncthreads();
LDG2S(As[1-pipeline_indicator], Bs[1-pipeline_indicator])
MATMUL(As[pipeline_indicator], Bs[pipeline_indicator])
pipeline_indicator = 1 - pipeline_indicator;
}
__syncthreads();
MATMUL(As[pipeline_indicator], Bs[pipeline_indicator])
__syncthreads();
implementation B is very similar to A, just use !pipeline_indicator instead of 1 - pipeline_indicator
LDG2S(As[0], Bs[0])
unsigned int pipeline_indicator = 0;
for (int i_step=0; i_step<K/16-1; ++i_step) {
__syncthreads();
LDG2S(As[!pipeline_indicator], Bs[!pipeline_indicator])
MATMUL(As[pipeline_indicator], Bs[pipeline_indicator])
pipeline_indicator = !pipeline_indicator;
}
__syncthreads();
MATMUL(As[pipeline_indicator], Bs[pipeline_indicator])
__syncthreads();
and here comes the implementation C, where I choose to get rid of the pipeline_indicator register and K/16-1 since in my test case, K=512:
LDG2S(As[0], Bs[0])
__syncthreads();
LDG2S(As[1], Bs[1])
MATMUL(As[0], Bs[0])
__syncthreads();
#pragma unroll
for (int i_step=0; i_step<15; ++i_step) {
LDG2S(As[0], Bs[0])
MATMUL(As[1], Bs[1])
__syncthreads();
LDG2S(As[1], Bs[1])
MATMUL(As[0], Bs[0])
__syncthreads();
}
MATMUL(As[1], Bs[1])
__syncthreads();
and implementation D is even more brute force, I unroll the for loop by hand — just copy-paste the code within the loop 15 times
LDG2S(As[0], Bs[0])
__syncthreads();
LDG2S(As[1], Bs[1])
MATMUL(As[0], Bs[0])
__syncthreads();
// 1
LDG2S(As[0], Bs[0])
MATMUL(As[1], Bs[1])
__syncthreads();
LDG2S(As[1], Bs[1])
MATMUL(As[0], Bs[0])
__syncthreads();
// 2
LDG2S(As[0], Bs[0])
MATMUL(As[1], Bs[1])
__syncthreads();
LDG2S(As[1], Bs[1])
MATMUL(As[0], Bs[0])
__syncthreads();
// 3, 4, 5, ... 15
......
MATMUL(As[1], Bs[1])
__syncthreads();
I was initially expecting implementation A, B, C and D got similar runtime since they are doing the same thing, but the profiling results shows that C and D is clearly better than B, and B is better than A. (as for C and D, C is better in the no-tensorcore version, and D is better in the tensorcore version)
I also read the asm code for each implementation, and it seems that C and D has less STS stalls compared with A and B, which I think explained the runtime observation, but I am still confused about why the compiler is treating those implementations differently? Is there a recommended approach to write the double-buffer pipeline?
Thank you so much!