32 byte coalesced access is faster than 128 byte coalesced access?

I always thought that coalesced global memory read/write meant whole warp access for a 4 byte type (128 bytes). I read that the coalescing only matters in 32 byte chunks, so I wrote the code below to test this. I commented on the various access patterns intentions. I expected 01 to be the fastest and 03 to be a close 2nd based on what I read, but 03 is about double the speed of 01, with the following timing on a 2080ti:

Elapsed time in microseconds 03 : 2418 us
Elapsed time in microseconds 01 : 5320 us
Elapsed time in microseconds 03 : 2460 us
Elapsed time in microseconds 02 : 16832 us
Elapsed time in microseconds 01 : 6783 us
Elapsed time in microseconds 03 : 2712 us
Elapsed time in microseconds 02 : 18250 us
Elapsed time in microseconds 01 : 5561 us

Any idea why this access pattern would be this much faster?

#include <chrono>
#include <iostream>

using namespace std;

#define checkCudaErrors( err ) \
  if( err != cudaSuccess ) { \
    std::cerr << "ERROR: " << cudaGetErrorString( err ) << std::endl; \
    exit( -1 ); \
  }

const int numStrides = 256;
const int numThreads = 256;

// Traditional access, whole warp/block coalesced read/write
__global__
void access01(float* d_mem) {
    for (int i = 0; i < numStrides; i++) {
        const int idx = threadIdx.x + blockDim.x * i;
        float v = d_mem[idx];
        d_mem[idx] = v + 1;
    }
}

// Bad access pattern, thread 0 read/write 0, thread 1 read/write 256, etc..
__global__
void access02(float* d_mem) {
    for (int i = 0; i < numStrides; i++) {
        const int idx = threadIdx.x * numStrides + i;
        float v = d_mem[idx];
        d_mem[idx] = v + 1;
    }
}

// Tuned access, I read that coalesced access only needs to be coalesced based on 32 byte chunks
// This access pattern has the first 8 threads read/write to same chunk and then go to the 32 bytes right next to it
// second 8 threads read/write continuing where first 8 threads will stop after loop
__global__
void access03(float* d_mem) {
    const int numFloatsChunked = 8;

    const int chunk = threadIdx.x / numFloatsChunked;
    const int tid = threadIdx.x % numFloatsChunked;

    for (int i = 0; i < numStrides; i++) {
        const int idx = tid + numFloatsChunked * numStrides * chunk + i*numFloatsChunked;
        float v = d_mem[idx];
        d_mem[idx] = v + 1;
    }
}

void runAccess01(float* d_mem) {
    // ACCESS 01
    //access01<<<1,numThreads>>>(d_mem);
    checkCudaErrors( cudaDeviceSynchronize() );

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < 100; i++) {
        access01<<<1,numThreads>>>(d_mem);
        checkCudaErrors( cudaDeviceSynchronize() );
    }
    checkCudaErrors( cudaDeviceSynchronize() );
    auto end = std::chrono::high_resolution_clock::now();

    cout << "Elapsed time in microseconds 01 : "
    << chrono::duration_cast<chrono::microseconds>(end - start).count()
    << " us" << endl;
}

void runAccess02(float* d_mem) {
    // ACCESS 02
    //access02<<<1,numThreads>>>(d_mem);
    checkCudaErrors( cudaDeviceSynchronize() );

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < 100; i++) {
        access02<<<1,numThreads>>>(d_mem);
        checkCudaErrors( cudaDeviceSynchronize() );
    }
    checkCudaErrors( cudaDeviceSynchronize() );
    auto end = std::chrono::high_resolution_clock::now();

    cout << "Elapsed time in microseconds 02 : "
    << chrono::duration_cast<chrono::microseconds>(end - start).count()
    << " us" << endl;
}

void runAccess03(float* d_mem) {
    // ACCESS 02
    //access02<<<1,numThreads>>>(d_mem);
    checkCudaErrors( cudaDeviceSynchronize() );

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < 100; i++) {
        access03<<<1,numThreads>>>(d_mem);
        checkCudaErrors( cudaDeviceSynchronize() );
    }
    checkCudaErrors( cudaDeviceSynchronize() );
    auto end = std::chrono::high_resolution_clock::now();

    cout << "Elapsed time in microseconds 03 : "
    << chrono::duration_cast<chrono::microseconds>(end - start).count()
    << " us" << endl;
}

int main() {
    const int numElems = numStrides * numThreads;
    const int elemSize = numElems * sizeof(float);

    float* h_mem = (float *) malloc(elemSize);
    float* d_mem;
    checkCudaErrors( cudaMalloc((void **) &d_mem, elemSize) );

    for (int i = 0; i < numElems; i++) {
        h_mem[i] = 0;
    }

    checkCudaErrors( cudaMemcpy(d_mem, h_mem, elemSize, cudaMemcpyHostToDevice) );

    runAccess03(d_mem);
    runAccess01(d_mem);
    runAccess03(d_mem);
    runAccess02(d_mem);
    runAccess01(d_mem);
    runAccess03(d_mem);
    runAccess02(d_mem);
    runAccess01(d_mem);

    checkCudaErrors( cudaMemcpy(h_mem, d_mem, elemSize, cudaMemcpyDeviceToHost) );

    const float v = h_mem[0];
    for (int i = 0; i < numElems; i++) {
        if (v != h_mem[i]) {
            cout << "err" << endl;
            exit(1);
        }
    }
    cout << endl;
    cout << "all are " << v << endl;
}
  1. I would like to point out that <<<1,numThreads>>> is not an efficient way to use the GPU. Therefore I would hesitate to draw sweeping conclusions from this test.
  2. I suggest you study the SASS, carefully. When I do that (CUDA 11.1, sm_75), I see substantial unrolling/reordering differences between the 01 and 03 case, where the 03 (i.e. faster) case is nicely unrolled and reordered, whereas the 01 case (i.e. slower) is unrolled but not reordered. This means that the 03 kernel can get more read requests in flight, more quickly. As you would expect, the 03 kernel reports noticeably higher register usage (33 vs. 20), due partly to the nice reordering.

Here’s an excerpt from the 01 kernel:

            Function : _Z8access01Pf
    .headerflags    @"EF_CUDA_SM75 EF_CUDA_PTX_SM(EF_CUDA_SM75)"
    /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;  /* 0x00000a00ff017624 */
                                                                        /* 0x000fd000078e00ff */
    /*0010*/                   S2R R3, SR_TID.X ;                       /* 0x0000000000037919 */
                                                                        /* 0x000e220000002100 */
    /*0020*/                   ULDC UR4, c[0x0][0x0] ;                  /* 0x0000000000047ab9 */
                                                                        /* 0x000fe40000000800 */
    /*0030*/                   UMOV UR6, 0xffffff00 ;                   /* 0xffffff0000067882 */
                                                                        /* 0x000fe40000000000 */
    /*0040*/                   USHF.L.U32 UR4, UR4, 0x2, URZ ;          /* 0x0000000204047899 */
                                                                        /* 0x000fd8000800063f */
    /*0050*/                   IMAD.MOV.U32 R0, RZ, RZ, 0x4 ;           /* 0x00000004ff007424 */
                                                                        /* 0x000fc800078e00ff */
    /*0060*/                   IMAD.WIDE R4, R3, R0, c[0x0][0x160] ;    /* 0x0000580003047625 */
                                                                        /* 0x003fd000078e0200 */
    /*0070*/                   LDG.E.SYS R2, [R4] ;                     /* 0x0000000004027381 */
                                                                        /* 0x000ea200001ee900 */
    /*0080*/                   USHF.R.S32.HI UR5, URZ, 0x1f, UR4 ;      /* 0x0000001f3f057899 */
                                                                        /* 0x000fe20008011404 */
    /*0090*/                   IADD3 R6, P0, R4, UR4, RZ ;              /* 0x0000000404067c10 */
                                                                        /* 0x000fca000ff1e0ff */
    /*00a0*/                   IADD3.X R7, R5, UR5, RZ, P0, !PT ;       /* 0x0000000505077c10 */
                                                                        /* 0x000fe200087fe4ff */
    /*00b0*/                   FADD R13, R2, 1 ;                        /* 0x3f800000020d7421 */
                                                                        /* 0x004fd00000000000 */
    /*00c0*/                   STG.E.SYS [R4], R13 ;                    /* 0x0000000d04007386 */
                                                                        /* 0x0001e8000010e900 */
    /*00d0*/                   LDG.E.SYS R2, [R6] ;                     /* 0x0000000006027381 */
                                                                        /* 0x000ea200001ee900 */
    /*00e0*/                   IADD3 R8, P0, R6, UR4, RZ ;              /* 0x0000000406087c10 */
                                                                        /* 0x000fc8000ff1e0ff */
    /*00f0*/                   IADD3.X R9, R7, UR5, RZ, P0, !PT ;       /* 0x0000000507097c10 */
                                                                        /* 0x000fe200087fe4ff */
    /*0100*/                   FADD R15, R2, 1 ;                        /* 0x3f800000020f7421 */
                                                                        /* 0x004fd00000000000 */
    /*0110*/                   STG.E.SYS [R6], R15 ;                    /* 0x0000000f06007386 */
                                                                        /* 0x0003e8000010e900 */
    /*0120*/                   LDG.E.SYS R2, [R8] ;                     /* 0x0000000008027381 */
                                                                        /* 0x000ea200001ee900 */
    /*0130*/                   IADD3 R10, P0, R8, UR4, RZ ;             /* 0x00000004080a7c10 */
                                                                        /* 0x000fc8000ff1e0ff */
    /*0140*/                   IADD3.X R11, R9, UR5, RZ, P0, !PT ;      /* 0x00000005090b7c10 */
                                                                        /* 0x000fe200087fe4ff */
    /*0150*/                   FADD R17, R2, 1 ;                        /* 0x3f80000002117421 */

Notice that the LDG and STG instructions basically alternate. Because the store is dependent on the load, the thread (i.e. warp) will stall at each FADD/STG instruction, waiting for the previous LDG.

Now lets look at the 03 kernel:

            Function : _Z8access03Pf
    .headerflags    @"EF_CUDA_SM75 EF_CUDA_PTX_SM(EF_CUDA_SM75)"
    /*0000*/                   MOV R1, c[0x0][0x28] ;                 /* 0x00000a0000017a02 */
                                                                      /* 0x000fd00000000f00 */
    /*0010*/                   S2R R3, SR_TID.X ;                     /* 0x0000000000037919 */
                                                                      /* 0x000e220000002100 */
    /*0020*/                   UMOV UR4, 0xffffff00 ;                 /* 0xffffff0000047882 */
                                                                      /* 0x000fe20000000000 */
    /*0030*/                   SHF.R.U32.HI R0, RZ, 0x3, R3 ;         /* 0x00000003ff007819 */
                                                                      /* 0x001fe40000011603 */
    /*0040*/                   LOP3.LUT R3, R3, 0x7, RZ, 0xc0, !PT ;  /* 0x0000000703037812 */
                                                                      /* 0x000fc800078ec0ff */
    /*0050*/                   LEA R0, R0, R3, 0xb ;                  /* 0x0000000300007211 */
                                                                      /* 0x000fd000078e58ff */
    /*0060*/                   MOV R3, 0x4 ;                          /* 0x0000000400037802 */
                                                                      /* 0x008fca0000000f00 */
    /*0070*/                   IMAD.WIDE R2, R0, R3, c[0x0][0x160] ;  /* 0x0000580000027625 */
                                                                      /* 0x000fd000078e0203 */
    /*0080*/                   LDG.E.SYS R4, [R2] ;                   /* 0x0000000002047381 */
                                                                      /* 0x000ea800001ee900 */
    /*0090*/                   LDG.E.SYS R6, [R2+0x20] ;              /* 0x0000200002067381 */
                                                                      /* 0x000ee800001ee900 */
    /*00a0*/                   LDG.E.SYS R8, [R2+0x40] ;              /* 0x0000400002087381 */
                                                                      /* 0x000f2800001ee900 */
    /*00b0*/                   LDG.E.SYS R10, [R2+0x60] ;             /* 0x00006000020a7381 */
                                                                      /* 0x000f6800001ee900 */
    /*00c0*/                   LDG.E.SYS R12, [R2+0x80] ;             /* 0x00008000020c7381 */
                                                                      /* 0x000f6800001ee900 */
    /*00d0*/                   LDG.E.SYS R14, [R2+0xa0] ;             /* 0x0000a000020e7381 */
                                                                      /* 0x000f6800001ee900 */
    /*00e0*/                   LDG.E.SYS R16, [R2+0xc0] ;             /* 0x0000c00002107381 */
                                                                      /* 0x000f6800001ee900 */
    /*00f0*/                   LDG.E.SYS R18, [R2+0xe0] ;             /* 0x0000e00002127381 */
                                                                      /* 0x000f6800001ee900 */
    /*0100*/                   LDG.E.SYS R20, [R2+0x100] ;            /* 0x0001000002147381 */
                                                                      /* 0x000f6800001ee900 */
    /*0110*/                   LDG.E.SYS R22, [R2+0x120] ;            /* 0x0001200002167381 */
                                                                      /* 0x000f6800001ee900 */
    /*0120*/                   LDG.E.SYS R24, [R2+0x140] ;            /* 0x0001400002187381 */
                                                                      /* 0x000f6800001ee900 */
    /*0130*/                   LDG.E.SYS R26, [R2+0x160] ;            /* 0x00016000021a7381 */
                                                                      /* 0x000f6800001ee900 */
    /*0140*/                   LDG.E.SYS R27, [R2+0x180] ;            /* 0x00018000021b7381 */
                                                                      /* 0x000f6800001ee900 */
    /*0150*/                   LDG.E.SYS R28, [R2+0x1a0] ;            /* 0x0001a000021c7381 */
                                                                      /* 0x000f6800001ee900 */

There are so many LDG instructions reordered by the compiler up to the top of code, that we haven’t even seen any STG instructions yet. Once we get to these LDG instructions, they can just issue in a long blast with no stalls - that is the most efficient way to feed the LSU’s. This pattern is one of the key reasons why the compiler seeks to aggressively unroll loops - to obtain machine benefit from instruction reordering.

It’s not obvious to me why the compiler does not reorder things in the 01 case. Possibly a subject for further analysis or a bug.

Thanks so much! I just hard coded the blockDim.x in 01 and the LDGs all got reordered and 01 now runs in 2/3 the time of 03. When I wrote the question I was wondering if the reason would be some hand wavy way the gpu does paging related to caching or something.

Hope you enjoyed your well deserved vacation!