Strided vs non-strided access

Hello,

I got confused with my experimental results. I thought that coalesced access would be more efficient than non-coalesced access, but the results indicate it may be not.

I have a batch of 10,240 number of matrices each of which performs a transposed triangular solve: L^Tx = r, where L is a lower triangular matrix. It is assumed that its entries are symmetrically stored in its upper part as well. Therefore, when we solve L^Tx=r, we could solve it by doing row access from its lower part or column access from its upper part.

I have two different implementation: one with row access and the other with column access.

Here is the code:

__global__
void row_access_gmem(int n, double *L, double *r, int lda)
{
    int tx = threadIdx.x;
    int start_L = blockIdx.x * (lda * n);
    int start_r = blockIdx.x * lda;

    for (int j=n-1; j>=0; j--) {
        if (tx == 0) {
            r[start_r + j] /= L[start_L + lda*j + j];
        }
        __syncthreads();
        if (tx < j) {
            r[start_r + tx] -= L[start_L + lda*tx + j]*r[start_r + j];
        }
        __syncthreads();
}
__global__
void col_access_gmem(int n, double *L, double *r, int lda)
{
    int tx = threadIdx.x;
    int start_L = blockIdx.x * (lda * n);
    int start_r = blockIdx.x * lda;

    for (int j=n-1; j>=0; j--) {
        if (tx == 0) {
            r[start_r + j] /= L[start_L + lda*j + j];
        }
        __syncthreads();
        if (tx < j) {
            r[start_r + tx] -= L[start_L + lda*j + tx]*r[start_r + j];
        }
        __syncthreads();
}

float launch(int n, double *d_L, double *d_r, int lda, int batch_count, std::string mode) 
{
    cudaEvent_t start, stop;
    float time;

    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    if (mode == "col_access") {
        cudaEventRecord(start, 0);
        col_access_gmem<<<batch_count, 32>>>(n, d_L, d_r, lda);
        cudaEventRecord(stop, 0);
    } else {
        cudaEventRecord(start, 0);
        row_access_gmem<<<batch_count, 32>>>(n, d_L, d_r, lda);
        cudaEventRecord(stop, 0);
    }
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time, start, stop);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    return time;
}

where batch_count is set to 10,240, lda the leading dimension of L set to n, d_L and d_r are a randomly generated matrix and a vector, respectively…

The only difference between these two functions are L[start_L + ldatx + j] for row_acess and L[start_L + ldaj + tx] for column_access function. I expected that column_access is faster than row_access, but experimental results on V100 over n=1,…,32 give confusing results: Overall, row_access seems faster. The values represent time in seconds to complete a batch of 10,240 transposed triangular solves. The values are averaged over 100 such batch solves.

Could anyone explain why this makes sense?

n column_access row_access
1 2.524960e-05 2.532480e-05
2 2.577056e-05 2.583265e-05
3 2.755616e-05 2.706208e-05
4 2.785056e-05 2.736832e-05
5 2.865984e-05 2.811584e-05
6 2.964961e-05 2.864960e-05
7 3.121760e-05 2.997823e-05
8 3.363008e-05 3.197472e-05
9 5.550368e-05 4.464032e-05
10 6.134815e-05 4.499680e-05
11 6.838144e-05 5.607841e-05
12 7.438912e-05 5.069087e-05
13 8.225953e-05 6.868767e-05
14 8.873472e-05 6.751201e-05
15 9.642623e-05 8.411776e-05
16 1.024435e-04 7.574111e-05
17 1.142887e-04 9.816932e-05
18 1.210586e-04 9.462494e-05
19 1.289104e-04 1.159174e-04
20 1.337415e-04 1.004269e-04
21 1.432275e-04 1.307974e-04
22 1.473923e-04 1.245030e-04
23 1.539002e-04 1.508035e-04
24 1.605379e-04 1.286886e-04
25 1.710672e-04 1.667926e-04
26 1.927619e-04 1.626445e-04
27 1.888992e-04 1.928192e-04
28 1.888048e-04 1.674023e-04
29 2.021040e-04 2.151600e-04
30 2.063350e-04 2.050176e-04
31 2.273561e-04 2.521434e-04
32 2.248604e-04 2.015907e-04

Could you please provide the complete code?

The accesses are typically fastest, when the threads within a warp access consecutive values.
That is the case, when the index contains +tx (and not lda+tx).
Also if conditions depending on the value of tx lead to diverging code, which should be avoided.

Most of your accesses do not depend on tx at all.
The same value is read for many threads… Your code does depend a lot on cache performance.

You have many __syncthreads(), which could slow down execution.

You have many double divisions, which depending on GPU and on flags (–use-fast-math?) could slow down computation by a lot.

Those are only guidelines. In your complicated case I would use a profiler (Nvidia Compute Nsight) to better understand the difference between both functions.

-use_fast_math does not affect double-precision computation, only single-precision computation.

The reason this nvcc command line flag has existed from the very start of CUDA (note the non-standard naming!) was the concern that programmers used to NVIDIA’s Cg shader language would find CUDA’s computational performance lacking. By specifying -use_fast_math they could closely approximate the fast-but-sloppy handling of float computation they were used to from Cg. This is an example of several design decisions in the early CUDA universe that were driven by a desire to maximize the speed of adoption.

Here is a snapshot of Nsight compute’s Memory Workload Analysis for n=16. In the figure, baseline is the col_access_gmem. It seems that row_access_gmem has more efficient memory access pattern than col_access_gmem, which is a bit contradictory to me.

Is this because row_access_gmem requests all 15 columns simultaneously in the first iteration of the for loop with j=15? I guess all future references for j=14,...,0 to matrix L in this case will be from the cache.

In contrast, col_access_gmem requests one column at a time for each j=15,...,0, and threads are accessing each column in a coalesced manner, although some threads are not active because of if (tx < j) condition. But, I do not think col_access_gmem incurs more than one memory transactions per column.

But, why row_access_gmem is more efficient?

I’ve edited my post to show the code that I executed. Here, d_L and d_r are a randomly generated n-by-n symmetric matrix and an n-dimensional vector, respectively.

Could you upload the .ncu-rep file? I don’t have access to a V100 GPU and cannot reproduce the results. I tested on a 3080 GPU and found that when using double data types for the matrix, the execution time of the two kernels is essentially the same. However, the double-precision throughput on the 3080 GPU is inherently low, so memory access performance doesn’t significantly impact overall performance. After switching to float data types and retesting, I found that, indeed, as your experimental results show, the performance of col_access_gmem is lower compared to row_access_gmem. There are several reasons for this phenomenon:

  1. The memory access and computation pattern in the kernel is: load → compute using the loaded data.
    This causes each computation to wait for the data to be loaded from global memory first. Generally, the compiler has optimization techniques to avoid pipeline stalls. For instance, it can change the instruction execution order by inserting other computation instructions between dependent memory access and compute instructions, or it can use loop unrolling to execute multiple memory access instructions in succession, followed by multiple computation instructions. However, in your program, the compiler cannot perform such optimizations. As a result, each thread must wait a long time after executing a data load instruction before it can execute the next instruction.
  2. Multiple warps running in parallel are not sufficient to hide the memory access latency.
  3. Caching improves the pipeline stall issue in the row_access_gmem kernel.
    In the row_access_gmem kernel, while the data accessed by the same warp is scattered, the data accessed by each thread is sequential. This means that when there is a cache miss, the currently needed data and the subsequently needed data are loaded into the cache simultaneously. This significantly improves the access speed for subsequent operations, effectively reducing pipeline stalls. The main advantage of coalesced memory access is to avoid bandwidth waste. For example, loading data from global memory to the L2 cache is done in 128-byte chunks. If non-coalesced access is 4 bytes, the effective utilization of bandwidth is only 1/32. If your program requires high bandwidth rather than low memory access latency, coalesced memory access can greatly improve performance. However, from the previous analysis, the main issue in your program is the lack of enough executable instructions between memory accesses and computations to avoid pipeline stalls. Therefore, non-coalesced memory access might achieve higher performance in this case.