Coalesced access to global memory for double4

Dear CUDA developers,

I’m a bit confused by the alignment for memory transactions. In my application, I use an array of double4. According to the type declaration, each element should align to 2x16byte. Further, according to the CUDA programming guide, the global memory is accessed in 32-, 64-, or 128-byte memory transactions which means that a single memory transaction can contain 4 double4 values. Thus, if each thread of a kernel accesses one element of the array (in a regular access pattern), I would assume that number of memory transactions is 1/4 the length of the array. However, the profiler shows that the number of transactions is equal to the length of the array. How can I get the memory access coalesced?

Are you on a Maxwell (sm_5x) GPU?

I’m pretty sure that Maxwell GMEM transactions are 32 bytes which would explain why you’re seeing a transaction per double4 element.

Are you profiling using the command-line nvprof or one of the GUI tools?

You can get the memory access coalesced if adjacent threads in a warp are requesting adjacent elements in the array. The coalescing rules are a bit more general than that, but that should be sufficient to observe good coalescing.

Coalescing has to do with access patterns generated by threads in a warp, for a given read or write instruction.

Alignment is a secondary issue.

Also, double4 is more than can be read by a single thread, in a single transaction. According to the programming guide:

[url]Programming Guide :: CUDA Toolkit Documentation

the maximum transaction size for a single thread is 16 bytes. Therefore double2 is the largest that can be read by a single thread, using a single read instruction.

If you do an aligned read of double2 per thread, where adjacent threads are requesting adjacent elements in the array, you should witness good coalescing.

The compiler will certainly break a read of double4 per thread into at least 2 transactions (maybe 4, not sure). These reads will not witness “perfect coalescing”, because of the inherent gaps when reading double4 as a sequence (lo-hi) of double2 quantities.

Thanks for your replies! I’m on a GK110 card on RHEL (the OS was not my choice). In order to figure out if accessing double4 slows down execution, I modified the vectorAdd example.

double4:

inline __host__ __device__ double4 operator+(double4 a, double4 b) {
	return make_double4(a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w);
}

__global__ void vectorAdd(const double4 *A, const double4 *B, double4 *C,
		int numElements) {
	int i = blockDim.x * blockIdx.x + threadIdx.x;

	if (i < numElements) {
		C[i] = A[i] + B[i];
	}
}

double:

__global__ void vectorAdd(const double *Ax, const double *Ay, const double *Az, const double *Aw,
		const double *Bx, const double *By, const double *Bz, const double *Bw,
		 double *Cx,  double *Cy,  double *Cz,  double *Cw, int numElements) {
	int i = blockDim.x * blockIdx.x + threadIdx.x;

	if (i < numElements) {
		Cx[i] = Ax[i] + Bx[i];
		Cy[i] = Ay[i] + By[i];
		Cz[i] = Az[i] + Bz[i];
		Cw[i] = Aw[i] + Bw[i];
	}
}

Although the profiler reports memory alignment and access pattern issues for the double4 version, the kernel execution times for both versions are almost identical.

caching effects will tend to mitigate access pattern issues, at this level of granularity

That is not too surprising. The version using ‘double’ uses 64-bit loads but should be perfectly coalesced. The version using ‘double4’ uses 128-bit loads but is imperfectly coalesced. Wider loads make for more effective buffering by the queues in the load/store units, which gains efficiency, coalescing issues then negatively impacts load performance, balancing the gain from wider load. That the two effects balance close to perfectly is likely an artifact of this particular code, and wouldn’t necessarily apply to all code context for which such an experiment is performed

Note that GPUs require data to be properly aligned or loads will return undefined data. The ‘double4’ type is declared to be aligned to 16 bytes, so a simple conversion of an 8-byte aligned ‘double’ pointer to a ‘double4’ pointer may result in incorrectly behaving code unless the 16-byte alignment of the pointer is checked, or guaranteed by construction.

I’m not sure that the implementation given for double4 operator + will break into double2 (128-bit) loads.

EDIT: Yes, it does.

It is definitely prudent to check this by inspecting the SASS. The vectorization of memory accesses is an optimization, and there have been instances in the past where the CUDA compiler temporarily “lost the magic sauce” to make it happen in particular contexts.

Yes, here’s the sass comparison between the two kernels. In additional to verifying 128-bit loads, it sheds light on what other factors might influence the performance (e.g. if there were extra integer pressure arising from some other code…)

$ cuobjdump -sass t900.o

Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_20
                Function : _Z9vectorAddPK7double4S1_PS_i
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];                      /* 0x2800440400005de4 */
        /*0008*/         S2R R0, SR_CTAID.X;                         /* 0x2c00000094001c04 */
        /*0010*/         S2R R2, SR_TID.X;                           /* 0x2c00000084009c04 */
        /*0018*/         IMAD R0, R0, c[0x0][0x8], R2;               /* 0x2004400020001ca3 */
        /*0020*/         ISETP.GE.AND P0, PT, R0, c[0x0][0x38], PT;  /* 0x1b0e4000e001dc23 */
        /*0028*/     @P0 BRA.U 0xb8;                                 /* 0x40000002200081e7 */
        /*0030*/    @!P0 MOV32I R3, 0x20;                            /* 0x180000008000e1e2 */
        /*0038*/    @!P0 IMAD R22.CC, R0, R3, c[0x0][0x20];          /* 0x200780008005a0a3 */
        /*0040*/    @!P0 IMAD.HI.X R23, R0, R3, c[0x0][0x24];        /* 0x208680009005e0e3 */
        /*0048*/    @!P0 IMAD R20.CC, R0, R3, c[0x0][0x28];          /* 0x20078000a00520a3 */
        /*0050*/    @!P0 LD.E.128 R16, [R22];                        /* 0x84000000016420c5 */
        /*0058*/    @!P0 IMAD.HI.X R21, R0, R3, c[0x0][0x2c];        /* 0x20868000b00560e3 */
        /*0060*/    @!P0 LD.E.128 R8, [R22+0x10];                    /* 0x84000000416220c5 */
        /*0068*/    @!P0 IMAD R2.CC, R0, R3, c[0x0][0x30];           /* 0x20078000c000a0a3 */
        /*0070*/    @!P0 LD.E.128 R12, [R20];                        /* 0x84000000014320c5 */
        /*0078*/    @!P0 LD.E.128 R4, [R20+0x10];                    /* 0x84000000414120c5 */
        /*0080*/    @!P0 IMAD.HI.X R3, R0, R3, c[0x0][0x34];         /* 0x20868000d000e0e3 */
        /*0088*/    @!P0 DADD R14, R18, R14;                         /* 0x480000003923a001 */
        /*0090*/    @!P0 DADD R12, R16, R12;                         /* 0x4800000031032001 */
        /*0098*/    @!P0 DADD R6, R10, R6;                           /* 0x4800000018a1a001 */
        /*00a0*/    @!P0 DADD R4, R8, R4;                            /* 0x4800000010812001 */
        /*00a8*/    @!P0 ST.E.128 [R2], R12;                         /* 0x94000000002320c5 */
        /*00b0*/    @!P0 ST.E.128 [R2+0x10], R4;                     /* 0x94000000402120c5 */
        /*00b8*/         EXIT;                                       /* 0x8000000000001de7 */
                ..............................................


                Function : _Z9vectorAddPKdS0_S0_S0_S0_S0_S0_S0_PdS1_S1_S1_i
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];                      /* 0x2800440400005de4 */
        /*0008*/         S2R R0, SR_CTAID.X;                         /* 0x2c00000094001c04 */
        /*0010*/         S2R R2, SR_TID.X;                           /* 0x2c00000084009c04 */
        /*0018*/         IMAD R0, R0, c[0x0][0x8], R2;               /* 0x2004400020001ca3 */
        /*0020*/         ISETP.GE.AND P0, PT, R0, c[0x0][0x80], PT;  /* 0x1b0e40020001dc23 */
        /*0028*/     @P0 BRA.U 0x178;                                /* 0x40000005200081e7 */
        /*0030*/    @!P0 MOV32I R12, 0x8;                            /* 0x18000000200321e2 */
        /*0038*/    @!P0 IMAD R4.CC, R0, R12, c[0x0][0x20];          /* 0x20198000800120a3 */
        /*0040*/    @!P0 IMAD.HI.X R5, R0, R12, c[0x0][0x24];        /* 0x20988000900160e3 */
        /*0048*/    @!P0 IMAD R14.CC, R0, R12, c[0x0][0x40];         /* 0x201980010003a0a3 */
        /*0050*/    @!P0 LD.E.64 R10, [R4];                          /* 0x840000000042a0a5 */
        /*0058*/    @!P0 IMAD.HI.X R15, R0, R12, c[0x0][0x44];       /* 0x209880011003e0e3 */
        /*0060*/    @!P0 IMAD R6.CC, R0, R12, c[0x0][0x60];          /* 0x201980018001a0a3 */
        /*0068*/    @!P0 LD.E.64 R8, [R14];                          /* 0x8400000000e220a5 */
        /*0070*/    @!P0 IMAD.HI.X R7, R0, R12, c[0x0][0x64];        /* 0x209880019001e0e3 */
        /*0078*/    @!P0 IMAD R2.CC, R0, R12, c[0x0][0x28];          /* 0x20198000a000a0a3 */
        /*0080*/    @!P0 IMAD.HI.X R3, R0, R12, c[0x0][0x2c];        /* 0x20988000b000e0e3 */
        /*0088*/    @!P0 IMAD R4.CC, R0, R12, c[0x0][0x48];          /* 0x20198001200120a3 */
        /*0090*/    @!P0 IMAD.HI.X R5, R0, R12, c[0x0][0x4c];        /* 0x20988001300160e3 */
        /*0098*/    @!P0 IMAD R16.CC, R0, R12, c[0x0][0x68];         /* 0x20198001a00420a3 */
        /*00a0*/    @!P0 IMAD.HI.X R17, R0, R12, c[0x0][0x6c];       /* 0x20988001b00460e3 */
        /*00a8*/    @!P0 IMAD R14.CC, R0, R12, c[0x0][0x30];         /* 0x20198000c003a0a3 */
        /*00b0*/    @!P0 IMAD.HI.X R15, R0, R12, c[0x0][0x34];       /* 0x20988000d003e0e3 */
        /*00b8*/    @!P0 DADD R18, R10, R8;                          /* 0x4800000020a4a001 */
        /*00c0*/    @!P0 ST.E.64 [R6], R18;                          /* 0x940000000064a0a5 */
        /*00c8*/    @!P0 LD.E.64 R8, [R2];                           /* 0x84000000002220a5 */
        /*00d0*/    @!P0 LD.E.64 R10, [R4];                          /* 0x840000000042a0a5 */
        /*00d8*/    @!P0 IMAD R2.CC, R0, R12, c[0x0][0x50];          /* 0x201980014000a0a3 */
        /*00e0*/    @!P0 IMAD.HI.X R3, R0, R12, c[0x0][0x54];        /* 0x209880015000e0e3 */
        /*00e8*/    @!P0 DADD R18, R8, R10;                          /* 0x480000002884a001 */
        /*00f0*/    @!P0 IMAD R10.CC, R0, R12, c[0x0][0x70];         /* 0x20198001c002a0a3 */
        /*00f8*/    @!P0 ST.E.64 [R16], R18;                         /* 0x940000000104a0a5 */
        /*0100*/    @!P0 LD.E.64 R4, [R14];                          /* 0x8400000000e120a5 */
        /*0108*/    @!P0 LD.E.64 R6, [R2];                           /* 0x840000000021a0a5 */
        /*0110*/    @!P0 IMAD.HI.X R11, R0, R12, c[0x0][0x74];       /* 0x20988001d002e0e3 */
        /*0118*/    @!P0 IMAD R8.CC, R0, R12, c[0x0][0x38];          /* 0x20198000e00220a3 */
        /*0120*/    @!P0 IMAD.HI.X R9, R0, R12, c[0x0][0x3c];        /* 0x20988000f00260e3 */
        /*0128*/    @!P0 IMAD R2.CC, R0, R12, c[0x0][0x58];          /* 0x201980016000a0a3 */
        /*0130*/    @!P0 IMAD.HI.X R3, R0, R12, c[0x0][0x5c];        /* 0x209880017000e0e3 */
        /*0138*/    @!P0 IMAD R16.CC, R0, R12, c[0x0][0x78];         /* 0x20198001e00420a3 */
        /*0140*/    @!P0 IMAD.HI.X R17, R0, R12, c[0x0][0x7c];       /* 0x20988001f00460e3 */
        /*0148*/    @!P0 DADD R14, R4, R6;                           /* 0x480000001843a001 */
        /*0150*/    @!P0 ST.E.64 [R10], R14;                         /* 0x9400000000a3a0a5 */
        /*0158*/    @!P0 LD.E.64 R4, [R8];                           /* 0x84000000008120a5 */
        /*0160*/    @!P0 LD.E.64 R6, [R2];                           /* 0x840000000021a0a5 */
        /*0168*/    @!P0 DADD R2, R4, R6;                            /* 0x480000001840a001 */
        /*0170*/    @!P0 ST.E.64 [R16], R2;                          /* 0x940000000100a0a5 */
        /*0178*/         EXIT;                                       /* 0x8000000000001de7 */
                .................................................................



Fatbin ptx code:
================
arch = sm_20
code version = [4,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
$