Kernel with a for-loop over matrix columns results in worse bandwidth for longer ranges?

I’m optimizing a kernel that analyzes a row-major matrix across columns. The thread must retain state, therefore it can’t simply be a monolithic kernel. The natural thing to do is to have one thread per column. I wrote a kernel that merely copies data while traversing the matrix columns, to measure the upper bound of bandwidth.

I noticed that if each thread loops across the whole column, it get a lower bandwidth than having a shorter span, while splitting the column into separate blocks. What is the reason for that, if the number of threads is constant and memory access is theoretically coalesced?

I modified the code from this post (https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/ , An Efficient Matrix Transpose in CUDA C/C++) trying to understand what is going on. I always have 1024 threads, analyzing a 1024x1024 matrix. In one extreme each thread is moving across a whole column. If I split each column in two, with two threads per column, keeping the number of vertical blocks == 1 and doubling the number of horizontal blocks, the kernel runs faster, until I reach the point where each thread moves only across 32 values, and we need 32 blocks to cover the whole matrix.

The number of threads is constant, and it’s a simple copy kernel. Memory should be coalesced, if I didn’t make any mistakes. There’s no use of shared memory to cause conflicts. What is the reason for this change in bandwidth?

Should I always divide my workload in tiles somehow, and avoid threads that cover long ranges of memory in a for-loop? What would be the reason for that, if memory access by the warp is always linear?

Device : Quadro P1000
Matrix size: 1024 1024, Block size: 1024 1, Tile size: 1024 1024
dimGrid: 1 1 1. dimBlock: 1024 1 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               23.62
              copy_blocky               23.27
Matrix size: 1024 1024, Block size: 1024 2, Tile size: 1024 1024
dimGrid: 2 1 1. dimBlock: 512 2 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               39.53
              copy_blocky               38.22
Matrix size: 1024 1024, Block size: 1024 4, Tile size: 1024 1024
dimGrid: 4 1 1. dimBlock: 256 4 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               64.82
              copy_blocky               55.65
Matrix size: 1024 1024, Block size: 1024 8, Tile size: 1024 1024
dimGrid: 8 1 1. dimBlock: 128 8 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               75.89
              copy_blocky               80.02
Matrix size: 1024 1024, Block size: 1024 16, Tile size: 1024 1024
dimGrid: 16 1 1. dimBlock: 64 16 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               79.81
              copy_blocky               79.70
Matrix size: 1024 1024, Block size: 1024 32, Tile size: 1024 1024
dimGrid: 32 1 1. dimBlock: 32 32 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               75.72
              copy_blocky               75.57
Matrix size: 1024 1024, Block size: 1024 64, Tile size: 1024 1024
dimGrid: 64 1 1. dimBlock: 16 64 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               74.10
              copy_blocky               73.41
Matrix size: 1024 1024, Block size: 1024 128, Tile size: 1024 1024
dimGrid: 128 1 1. dimBlock: 8 128 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               53.39
              copy_blocky               49.87
Matrix size: 1024 1024, Block size: 1024 256, Tile size: 1024 1024
dimGrid: 256 1 1. dimBlock: 4 256 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               46.99
              copy_blocky               47.19
Matrix size: 1024 1024, Block size: 1024 512, Tile size: 1024 1024
dimGrid: 512 1 1. dimBlock: 2 512 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               28.67
              copy_blocky               27.38
Matrix size: 1024 1024, Block size: 1024 1024, Tile size: 1024 1024
dimGrid: 1024 1 1. dimBlock: 1 1024 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               14.94
              copy_blocky               14.37

The code:

#include <stdio.h>
#include <assert.h>

// Convenience function for checking CUDA runtime API results
// can be wrapped around any runtime API call. No-op in release builds.
inline
cudaError_t checkCuda(cudaError_t result) {
#if defined(DEBUG) || defined(_DEBUG)
    if (result != cudaSuccess) {
      fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result));
      assert(result == cudaSuccess);
    }
#endif
    return result;
}

const int TILE_DIM = 1024;
//const int BLOCK_ROWS = 4;
const int NUM_REPS = 100;

// Check errors and print GB/s
void postprocess(const float *ref, const float *res, int n, float ms) {
    bool passed = true;
    for (int i = 0; i < n; i++)
        if (res[i] != ref[i]) {
            printf("%d %f %f\n", i, res[i], ref[i]);
            printf("%25s\n", "*** FAILED ***");
            passed = false;
            break;
        }
    if (passed)
        printf("%20.2f\n", 2 * n * sizeof(float) * 1e-6 * NUM_REPS / ms);
}

// simple copy kernel
// Used as reference case representing best effective bandwidth.
__global__ void copy_scattered(float *odata, const float *idata, int BLOCK_ROWS) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * TILE_DIM + threadIdx.y;
    int width = gridDim.x * blockDim.x;

    for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
        odata[(y + j) * width + x] = idata[(y + j) * width + x];
}

// simple copy kernel
// Used as reference case representing best effective bandwidth.
__global__ void copy_blocky(float *odata, const float *idata, int BLOCK_ROWS) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * TILE_DIM + threadIdx.y * (TILE_DIM / BLOCK_ROWS);
    int width = gridDim.x * blockDim.x;

    for (int j = 0; j < TILE_DIM / BLOCK_ROWS; j += 1)
        odata[(y + j) * width + x] = idata[(y + j) * width + x];
}

int main(int argc, char **argv) {
    const int nx = 1024;
    const int ny = 1024;
    const int mem_size = nx * ny * sizeof(float);

    int devId = 0;
    if (argc > 1) devId = atoi(argv[1]);

    cudaDeviceProp prop;
    checkCuda(cudaGetDeviceProperties(&prop, devId));
    printf("\nDevice : %s\n", prop.name);

    checkCuda(cudaSetDevice(devId));

    float *h_idata = (float *) malloc(mem_size);
    float *h_cdata = (float *) malloc(mem_size);
    float *h_tdata = (float *) malloc(mem_size);
    float *gold = (float *) malloc(mem_size);

    float *d_idata, *d_cdata, *d_tdata;
    checkCuda(cudaMalloc(&d_idata, mem_size));
    checkCuda(cudaMalloc(&d_cdata, mem_size));
    checkCuda(cudaMalloc(&d_tdata, mem_size));

    for (int BLOCK_ROWS = 1; BLOCK_ROWS <= 1024; BLOCK_ROWS <<= 1) {
        dim3 dimGrid(nx / (1024 / BLOCK_ROWS), ny / TILE_DIM, 1);
        dim3 dimBlock(1024 / BLOCK_ROWS, BLOCK_ROWS, 1);

        printf("Matrix size: %d %d, Block size: %d %d, Tile size: %d %d\n",
               nx, ny, TILE_DIM, BLOCK_ROWS, TILE_DIM, TILE_DIM);
        printf("dimGrid: %d %d %d. dimBlock: %d %d %d\n",
               dimGrid.x, dimGrid.y, dimGrid.z, dimBlock.x, dimBlock.y, dimBlock.z);

        // check parameters and calculate execution configuration
        if (nx % TILE_DIM || ny % TILE_DIM) {
            printf("nx and ny must be a multiple of TILE_DIM\n");
            goto error_exit;
        }

        if (TILE_DIM % BLOCK_ROWS) {
            printf("TILE_DIM must be a multiple of BLOCK_ROWS\n");
            goto error_exit;
        }

        // host
        for (int j = 0; j < ny; j++)
            for (int i = 0; i < nx; i++)
                h_idata[j * nx + i] = j * nx + i;

        // correct result for error checking
        for (int j = 0; j < ny; j++)
            for (int i = 0; i < nx; i++)
                gold[j * nx + i] = h_idata[i * nx + j];

        // device
        checkCuda(cudaMemcpy(d_idata, h_idata, mem_size, cudaMemcpyHostToDevice));

        // events for timing
        cudaEvent_t startEvent, stopEvent;
        checkCuda(cudaEventCreate(&startEvent));
        checkCuda(cudaEventCreate(&stopEvent));
        float ms;

        // ------------
        // time kernels
        // ------------
        printf("%25s%25s\n", "Routine", "Bandwidth (GB/s)");

        // ----
        // copy scattered
        // ----
        printf("%25s", "copy_scattered");
        checkCuda(cudaMemset(d_cdata, 0, mem_size));
        // warm up
        copy_scattered<<<dimGrid, dimBlock>>>(d_cdata, d_idata, BLOCK_ROWS);
        checkCuda(cudaEventRecord(startEvent, 0));
        for (int i = 0; i < NUM_REPS; i++)
            copy_scattered<<<dimGrid, dimBlock>>>(d_cdata, d_idata, BLOCK_ROWS);
        checkCuda(cudaEventRecord(stopEvent, 0));
        checkCuda(cudaEventSynchronize(stopEvent));
        checkCuda(cudaEventElapsedTime(&ms, startEvent, stopEvent));
        checkCuda(cudaMemcpy(h_cdata, d_cdata, mem_size, cudaMemcpyDeviceToHost));
        postprocess(h_idata, h_cdata, nx * ny, ms);
        
        // ----
        // copy blocky
        // ----
        printf("%25s", "copy_blocky");
        checkCuda(cudaMemset(d_cdata, 0, mem_size));
        // warm up
        copy_blocky<<<dimGrid, dimBlock>>>(d_cdata, d_idata, BLOCK_ROWS);
        checkCuda(cudaEventRecord(startEvent, 0));
        for (int i = 0; i < NUM_REPS; i++)
            copy_blocky<<<dimGrid, dimBlock>>>(d_cdata, d_idata, BLOCK_ROWS);
        checkCuda(cudaEventRecord(stopEvent, 0));
        checkCuda(cudaEventSynchronize(stopEvent));
        checkCuda(cudaEventElapsedTime(&ms, startEvent, stopEvent));
        checkCuda(cudaMemcpy(h_cdata, d_cdata, mem_size, cudaMemcpyDeviceToHost));
        postprocess(h_idata, h_cdata, nx * ny, ms);

        error_exit:
        // cleanup
        checkCuda(cudaEventDestroy(startEvent));
        checkCuda(cudaEventDestroy(stopEvent));
    }

    checkCuda(cudaFree(d_tdata));
    checkCuda(cudaFree(d_cdata));
    checkCuda(cudaFree(d_idata));
    free(h_idata);
    free(h_tdata);
    free(h_cdata);
    free(gold);

}

A Quadro P1000 has 5 SMs. (You can use deviceQuery to discover this.) Each SM can support up to 2048 threads, so the total thread-carrying capacity to saturate the device could be as high as ~10,000 threads. You’re seeing a trend of increasing bandwidth in the first 5 or so test results because you are increasing the total thread count of your kernel. 1024 threads in a single block is not enough to get full performance from any GPU.

This performance evaporates over the last 5 or so results because you are starting to break coalescing as you create blocks that have fewer than 32 threads in the x-dimension. Let’s take the last result as an example:

block dimensions:

thread indexing calculation, considering the first warp in block (0,0):

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y * (TILE_DIM / BLOCK_ROWS);
int width = gridDim.x * blockDim.x;
(y + j) * width + x

threadIdx.x is always zero (each block has only 1 thread “width” in the x-dimension)

for the first warp, threadIdx.y varies from 0…31
for block (0,0), blockIdx.x and blockIdx.y are zero.

Therefore x is always zero across this warp. y takes on values of 0…31 multiplied by (TILE_DIM / BLOCK_ROWS) which is 1024/1024, so y takes on values 0…31. So far, so good. But y is also multiplied by width. This means that the net calculation across this warp is:

(y + j) * width + x
((0..31) + j) * width + 0

If width is anything other than 1 (hint: it is), adjacent threads in this warp will not be accessing adjacent locations in memory. You have broken coalescing based on your block dimension choice. As soon as you go to less than 32 threads in block x-dimension for this kernel design, you are breaking coalescing, and your results cascade downwards in performance as you make the scenario worse.

BTW you can use a profiler to investigate your coalescing claim. For example if you used nvprof, you could ask for the metric gld_efficiency. By running your tests one at a time, it would be possible to see that this varies across your tests.

1 Like

Thanks for the very elucidative answer! I’m a newbie used to the Jetson Nano, with a single SM, and never thought of multiple blocks running in parallel before.

I’m still a bit surprised, though. Should I always expect that all the concurrent threads supported by a device need to be running in order to saturate the memory bandwidth in a simple copy kernel? I guess it makes sense, otherwise you’ll often suffer from starvation?..

The Jetson Nano apparently supports 2048 concurrent threads. Do I need to schedule all the 2 blocks of 1024 threads (maximum supported by the device) to achieve maximum memory bandwidth, as a rule of thumb?

Related question: does it matter if a whole array is consumed linearly, or can the access be pretty much random as long as we fetch (aligned) 128-byte chunks?

Thanks again!

I think that is a good rule of thumb. (In fact, I consider that a general CUDA optimization principle, for all CUDA codes.) Can you do it with less? Probably. In fact I think your results might suggest that (you might be getting approx full bandwidth with ~8000 threads. It doesn’t look like you are getting full bandwidth with ~4000 threads). Can you do it with 1024 threads if your device supports 10,000? My guess would be not.

I think this is asking the same question. I think it is probably a good idea. I have no idea if it is necessary, perhaps not. You might be able to do it with 1024 threads. Could you do it with 32 threads? My guess would be not. But you might surprise me.

Hard to answer. My general thinking is that you can pretty much access “scattered” aligned 128 byte chunks and generally get pretty good bandwidth. The DRAM operates with a granularity of 32 bytes per load AFAIK. There are some factors that may work against this, such as memory partition behavior, TLB issues for large data sets, and possibly others (e.g. prefetching, which I think is currently not really a thing on GPUs). YMMV.

Rather than asking lots of questions like this, my view is to write your own test cases to figure things out. When experiment lines up with intuition, great. When it doesn’t you then research to figure out why (then, maybe ask questions). I consider this to be a better form of learning. It’s how I learned this stuff, for the most part. YMMV. To each their own, etc. Just a suggestion.

Excellent, that was very helpful, thank you! I’m watching that video.

I agree it’s important to make experiments, that’s why I made sure I had some code in my post. I’m appending below the results of the same program in a Jetson Nano for completeness, it shows that the bandwidth saturates earlier, when I guess we reach the maximum number of threads.

The lesson I wish I had learned earlier is that reaching maximum bandwidth in an I/O bound kernel may require, and almost certainly does, scheduling not only a full warp or a full block, but even multiple blocks. When you write a monolithic kernel you just tend to do this naturally, and having a long for-loop may deny an opportunity of dividing the workload. I think I grok it now.

Device : NVIDIA Tegra X1
Matrix size: 1024 1024, Block size: 1024 1, Tile size: 1024 1024
dimGrid: 1 1 1. dimBlock: 1024 1 1
                  Routine         Bandwidth (GB/s)
           copy_scattered                3.87
              copy_blocky               10.05
Matrix size: 1024 1024, Block size: 1024 2, Tile size: 1024 1024
dimGrid: 2 1 1. dimBlock: 512 2 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               13.86
              copy_blocky               14.84
Matrix size: 1024 1024, Block size: 1024 4, Tile size: 1024 1024
dimGrid: 4 1 1. dimBlock: 256 4 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               14.48
              copy_blocky               14.52
Matrix size: 1024 1024, Block size: 1024 8, Tile size: 1024 1024
dimGrid: 8 1 1. dimBlock: 128 8 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               14.64
              copy_blocky               14.81
Matrix size: 1024 1024, Block size: 1024 16, Tile size: 1024 1024
dimGrid: 16 1 1. dimBlock: 64 16 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               15.06
              copy_blocky               14.98
Matrix size: 1024 1024, Block size: 1024 32, Tile size: 1024 1024
dimGrid: 32 1 1. dimBlock: 32 32 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               14.84
              copy_blocky               14.82
Matrix size: 1024 1024, Block size: 1024 64, Tile size: 1024 1024
dimGrid: 64 1 1. dimBlock: 16 64 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               14.21
              copy_blocky               11.55
Matrix size: 1024 1024, Block size: 1024 128, Tile size: 1024 1024
dimGrid: 128 1 1. dimBlock: 8 128 1
                  Routine         Bandwidth (GB/s)
           copy_scattered               10.91
              copy_blocky                8.53
Matrix size: 1024 1024, Block size: 1024 256, Tile size: 1024 1024
dimGrid: 256 1 1. dimBlock: 4 256 1
                  Routine         Bandwidth (GB/s)
           copy_scattered                8.45
              copy_blocky                6.62
Matrix size: 1024 1024, Block size: 1024 512, Tile size: 1024 1024
dimGrid: 512 1 1. dimBlock: 2 512 1
                  Routine         Bandwidth (GB/s)
           copy_scattered                4.50
              copy_blocky                3.81
Matrix size: 1024 1024, Block size: 1024 1024, Tile size: 1024 1024
dimGrid: 1024 1 1. dimBlock: 1 1024 1
                  Routine         Bandwidth (GB/s)
           copy_scattered                2.28
              copy_blocky                2.28