Strided vs non-strided access

Hello,

I got confused with my experimental results. I thought that coalesced access would be more efficient than non-coalesced access, but the results indicate it may be not.

I have a batch of 10,240 number of matrices each of which performs a transposed triangular solve: L^Tx = r, where L is a lower triangular matrix. It is assumed that its entries are symmetrically stored in its upper part as well. Therefore, when we solve L^Tx=r, we could solve it by doing row access from its lower part or column access from its upper part.

I have two different implementation: one with row access and the other with column access.

Here is the code:

__global__
void row_access_gmem(int n, double *L, double *r, int lda)
{
    int tx = threadIdx.x;
    int start_L = blockIdx.x * (lda * n);
    int start_r = blockIdx.x * lda;

    for (int j=n-1; j>=0; j--) {
        if (tx == 0) {
            r[start_r + j] /= L[start_L + lda*j + j];
        }
        __syncthreads();
        if (tx < j) {
            r[start_r + tx] -= L[start_L + lda*tx + j]*r[start_r + j];
        }
        __syncthreads();
}
__global__
void col_access_gmem(int n, double *L, double *r, int lda)
{
    int tx = threadIdx.x;
    int start_L = blockIdx.x * (lda * n);
    int start_r = blockIdx.x * lda;

    for (int j=n-1; j>=0; j--) {
        if (tx == 0) {
            r[start_r + j] /= L[start_L + lda*j + j];
        }
        __syncthreads();
        if (tx < j) {
            r[start_r + tx] -= L[start_L + lda*j + tx]*r[start_r + j];
        }
        __syncthreads();
}

float launch(int n, double *d_L, double *d_r, int lda, int batch_count, std::string mode) 
{
    cudaEvent_t start, stop;
    float time;

    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    if (mode == "col_access") {
        cudaEventRecord(start, 0);
        col_access_gmem<<<batch_count, 32>>>(n, d_L, d_r, lda);
        cudaEventRecord(stop, 0);
    } else {
        cudaEventRecord(start, 0);
        row_access_gmem<<<batch_count, 32>>>(n, d_L, d_r, lda);
        cudaEventRecord(stop, 0);
    }
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time, start, stop);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    return time;
}

where batch_count is set to 10,240, lda the leading dimension of L set to n, d_L and d_r are a randomly generated matrix and a vector, respectively…

The only difference between these two functions are L[start_L + ldatx + j] for row_acess and L[start_L + ldaj + tx] for column_access function. I expected that column_access is faster than row_access, but experimental results on V100 over n=1,…,32 give confusing results: Overall, row_access seems faster. The values represent time in seconds to complete a batch of 10,240 transposed triangular solves. The values are averaged over 100 such batch solves.

Could anyone explain why this makes sense?

n column_access row_access
1 2.524960e-05 2.532480e-05
2 2.577056e-05 2.583265e-05
3 2.755616e-05 2.706208e-05
4 2.785056e-05 2.736832e-05
5 2.865984e-05 2.811584e-05
6 2.964961e-05 2.864960e-05
7 3.121760e-05 2.997823e-05
8 3.363008e-05 3.197472e-05
9 5.550368e-05 4.464032e-05
10 6.134815e-05 4.499680e-05
11 6.838144e-05 5.607841e-05
12 7.438912e-05 5.069087e-05
13 8.225953e-05 6.868767e-05
14 8.873472e-05 6.751201e-05
15 9.642623e-05 8.411776e-05
16 1.024435e-04 7.574111e-05
17 1.142887e-04 9.816932e-05
18 1.210586e-04 9.462494e-05
19 1.289104e-04 1.159174e-04
20 1.337415e-04 1.004269e-04
21 1.432275e-04 1.307974e-04
22 1.473923e-04 1.245030e-04
23 1.539002e-04 1.508035e-04
24 1.605379e-04 1.286886e-04
25 1.710672e-04 1.667926e-04
26 1.927619e-04 1.626445e-04
27 1.888992e-04 1.928192e-04
28 1.888048e-04 1.674023e-04
29 2.021040e-04 2.151600e-04
30 2.063350e-04 2.050176e-04
31 2.273561e-04 2.521434e-04
32 2.248604e-04 2.015907e-04

Could you please provide the complete code?

The accesses are typically fastest, when the threads within a warp access consecutive values.
That is the case, when the index contains +tx (and not lda+tx).
Also if conditions depending on the value of tx lead to diverging code, which should be avoided.

Most of your accesses do not depend on tx at all.
The same value is read for many threads… Your code does depend a lot on cache performance.

You have many __syncthreads(), which could slow down execution.

You have many double divisions, which depending on GPU and on flags (–use-fast-math?) could slow down computation by a lot.

Those are only guidelines. In your complicated case I would use a profiler (Nvidia Compute Nsight) to better understand the difference between both functions.

-use_fast_math does not affect double-precision computation, only single-precision computation.

The reason this nvcc command line flag has existed from the very start of CUDA (note the non-standard naming!) was the concern that programmers used to NVIDIA’s Cg shader language would find CUDA’s computational performance lacking. By specifying -use_fast_math they could closely approximate the fast-but-sloppy handling of float computation they were used to from Cg. This is an example of several design decisions in the early CUDA universe that were driven by a desire to maximize the speed of adoption.

Here is a snapshot of Nsight compute’s Memory Workload Analysis for n=16. In the figure, baseline is the col_access_gmem. It seems that row_access_gmem has more efficient memory access pattern than col_access_gmem, which is a bit contradictory to me.

Is this because row_access_gmem requests all 15 columns simultaneously in the first iteration of the for loop with j=15? I guess all future references for j=14,...,0 to matrix L in this case will be from the cache.

In contrast, col_access_gmem requests one column at a time for each j=15,...,0, and threads are accessing each column in a coalesced manner, although some threads are not active because of if (tx < j) condition. But, I do not think col_access_gmem incurs more than one memory transactions per column.

But, why row_access_gmem is more efficient?

I’ve edited my post to show the code that I executed. Here, d_L and d_r are a randomly generated n-by-n symmetric matrix and an n-dimensional vector, respectively.