Hello,

I got confused with my experimental results. I thought that coalesced access would be more efficient than non-coalesced access, but the results indicate it may be not.

I have a batch of 10,240 number of matrices each of which performs a transposed triangular solve: L^Tx = r, where L is a lower triangular matrix. It is assumed that its entries are symmetrically stored in its upper part as well. Therefore, when we solve L^Tx=r, we could solve it by doing row access from its lower part or column access from its upper part.

I have two different implementation: one with row access and the other with column access.

Here is the code:

```
__global__
void row_access_gmem(int n, double *L, double *r, int lda)
{
int tx = threadIdx.x;
int start_L = blockIdx.x * (lda * n);
int start_r = blockIdx.x * lda;
for (int j=n-1; j>=0; j--) {
if (tx == 0) {
r[start_r + j] /= L[start_L + lda*j + j];
}
__syncthreads();
if (tx < j) {
r[start_r + tx] -= L[start_L + lda*tx + j]*r[start_r + j];
}
__syncthreads();
}
__global__
void col_access_gmem(int n, double *L, double *r, int lda)
{
int tx = threadIdx.x;
int start_L = blockIdx.x * (lda * n);
int start_r = blockIdx.x * lda;
for (int j=n-1; j>=0; j--) {
if (tx == 0) {
r[start_r + j] /= L[start_L + lda*j + j];
}
__syncthreads();
if (tx < j) {
r[start_r + tx] -= L[start_L + lda*j + tx]*r[start_r + j];
}
__syncthreads();
}
float launch(int n, double *d_L, double *d_r, int lda, int batch_count, std::string mode)
{
cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
if (mode == "col_access") {
cudaEventRecord(start, 0);
col_access_gmem<<<batch_count, 32>>>(n, d_L, d_r, lda);
cudaEventRecord(stop, 0);
} else {
cudaEventRecord(start, 0);
row_access_gmem<<<batch_count, 32>>>(n, d_L, d_r, lda);
cudaEventRecord(stop, 0);
}
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
return time;
}
```

where batch_count is set to 10,240, lda the leading dimension of L set to n, d_L and d_r are a randomly generated matrix and a vector, respectively…

The only difference between these two functions are L[start_L + lda*tx + j] for row_acess and L[start_L + lda*j + tx] for column_access function. I expected that column_access is faster than row_access, but experimental results on V100 over n=1,…,32 give confusing results: Overall, row_access seems faster. The values represent time in seconds to complete a batch of 10,240 transposed triangular solves. The values are averaged over 100 such batch solves.

Could anyone explain why this makes sense?

n | column_access | row_access |
---|---|---|

1 | 2.524960e-05 | 2.532480e-05 |

2 | 2.577056e-05 | 2.583265e-05 |

3 | 2.755616e-05 | 2.706208e-05 |

4 | 2.785056e-05 | 2.736832e-05 |

5 | 2.865984e-05 | 2.811584e-05 |

6 | 2.964961e-05 | 2.864960e-05 |

7 | 3.121760e-05 | 2.997823e-05 |

8 | 3.363008e-05 | 3.197472e-05 |

9 | 5.550368e-05 | 4.464032e-05 |

10 | 6.134815e-05 | 4.499680e-05 |

11 | 6.838144e-05 | 5.607841e-05 |

12 | 7.438912e-05 | 5.069087e-05 |

13 | 8.225953e-05 | 6.868767e-05 |

14 | 8.873472e-05 | 6.751201e-05 |

15 | 9.642623e-05 | 8.411776e-05 |

16 | 1.024435e-04 | 7.574111e-05 |

17 | 1.142887e-04 | 9.816932e-05 |

18 | 1.210586e-04 | 9.462494e-05 |

19 | 1.289104e-04 | 1.159174e-04 |

20 | 1.337415e-04 | 1.004269e-04 |

21 | 1.432275e-04 | 1.307974e-04 |

22 | 1.473923e-04 | 1.245030e-04 |

23 | 1.539002e-04 | 1.508035e-04 |

24 | 1.605379e-04 | 1.286886e-04 |

25 | 1.710672e-04 | 1.667926e-04 |

26 | 1.927619e-04 | 1.626445e-04 |

27 | 1.888992e-04 | 1.928192e-04 |

28 | 1.888048e-04 | 1.674023e-04 |

29 | 2.021040e-04 | 2.151600e-04 |

30 | 2.063350e-04 | 2.050176e-04 |

31 | 2.273561e-04 | 2.521434e-04 |

32 | 2.248604e-04 | 2.015907e-04 |