Coalesced access to global memory for double4

stephan_h · September 8, 2015, 2:00pm

Dear CUDA developers,

I’m a bit confused by the alignment for memory transactions. In my application, I use an array of double4. According to the type declaration, each element should align to 2x16byte. Further, according to the CUDA programming guide, the global memory is accessed in 32-, 64-, or 128-byte memory transactions which means that a single memory transaction can contain 4 double4 values. Thus, if each thread of a kernel accesses one element of the array (in a regular access pattern), I would assume that number of memory transactions is 1/4 the length of the array. However, the profiler shows that the number of transactions is equal to the length of the array. How can I get the memory access coalesced?

allanmac · September 8, 2015, 4:00pm

Are you on a Maxwell (sm_5x) GPU?

I’m pretty sure that Maxwell GMEM transactions are 32 bytes which would explain why you’re seeing a transaction per double4 element.

Are you profiling using the command-line nvprof or one of the GUI tools?

Robert_Crovella · September 8, 2015, 4:01pm

You can get the memory access coalesced if adjacent threads in a warp are requesting adjacent elements in the array. The coalescing rules are a bit more general than that, but that should be sufficient to observe good coalescing.

Coalescing has to do with access patterns generated by threads in a warp, for a given read or write instruction.

Alignment is a secondary issue.

Also, double4 is more than can be read by a single thread, in a single transaction. According to the programming guide:

[url]Programming Guide :: CUDA Toolkit Documentation

the maximum transaction size for a single thread is 16 bytes. Therefore double2 is the largest that can be read by a single thread, using a single read instruction.

If you do an aligned read of double2 per thread, where adjacent threads are requesting adjacent elements in the array, you should witness good coalescing.

The compiler will certainly break a read of double4 per thread into at least 2 transactions (maybe 4, not sure). These reads will not witness “perfect coalescing”, because of the inherent gaps when reading double4 as a sequence (lo-hi) of double2 quantities.

stephan_h · September 8, 2015, 4:33pm

Thanks for your replies! I’m on a GK110 card on RHEL (the OS was not my choice). In order to figure out if accessing double4 slows down execution, I modified the vectorAdd example.

double4:

inline __host__ __device__ double4 operator+(double4 a, double4 b) {
	return make_double4(a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w);
}

__global__ void vectorAdd(const double4 *A, const double4 *B, double4 *C,
		int numElements) {
	int i = blockDim.x * blockIdx.x + threadIdx.x;

	if (i < numElements) {
		C[i] = A[i] + B[i];
	}
}

double:

__global__ void vectorAdd(const double *Ax, const double *Ay, const double *Az, const double *Aw,
		const double *Bx, const double *By, const double *Bz, const double *Bw,
		 double *Cx,  double *Cy,  double *Cz,  double *Cw, int numElements) {
	int i = blockDim.x * blockIdx.x + threadIdx.x;

	if (i < numElements) {
		Cx[i] = Ax[i] + Bx[i];
		Cy[i] = Ay[i] + By[i];
		Cz[i] = Az[i] + Bz[i];
		Cw[i] = Aw[i] + Bw[i];
	}
}

Although the profiler reports memory alignment and access pattern issues for the double4 version, the kernel execution times for both versions are almost identical.

Robert_Crovella · September 8, 2015, 4:46pm

caching effects will tend to mitigate access pattern issues, at this level of granularity

njuffa · September 8, 2015, 4:52pm

That is not too surprising. The version using ‘double’ uses 64-bit loads but should be perfectly coalesced. The version using ‘double4’ uses 128-bit loads but is imperfectly coalesced. Wider loads make for more effective buffering by the queues in the load/store units, which gains efficiency, coalescing issues then negatively impacts load performance, balancing the gain from wider load. That the two effects balance close to perfectly is likely an artifact of this particular code, and wouldn’t necessarily apply to all code context for which such an experiment is performed

Note that GPUs require data to be properly aligned or loads will return undefined data. The ‘double4’ type is declared to be aligned to 16 bytes, so a simple conversion of an 8-byte aligned ‘double’ pointer to a ‘double4’ pointer may result in incorrectly behaving code unless the 16-byte alignment of the pointer is checked, or guaranteed by construction.

Robert_Crovella · September 8, 2015, 5:02pm

I’m not sure that the implementation given for double4 operator + will break into double2 (128-bit) loads.

EDIT: Yes, it does.

njuffa · September 8, 2015, 6:03pm

It is definitely prudent to check this by inspecting the SASS. The vectorization of memory accesses is an optimization, and there have been instances in the past where the CUDA compiler temporarily “lost the magic sauce” to make it happen in particular contexts.

Robert_Crovella · September 8, 2015, 6:49pm

Yes, here’s the sass comparison between the two kernels. In additional to verifying 128-bit loads, it sheds light on what other factors might influence the performance (e.g. if there were extra integer pressure arising from some other code…)

$ cuobjdump -sass t900.o

Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_20
                Function : _Z9vectorAddPK7double4S1_PS_i
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];                      /* 0x2800440400005de4 */
        /*0008*/         S2R R0, SR_CTAID.X;                         /* 0x2c00000094001c04 */
        /*0010*/         S2R R2, SR_TID.X;                           /* 0x2c00000084009c04 */
        /*0018*/         IMAD R0, R0, c[0x0][0x8], R2;               /* 0x2004400020001ca3 */
        /*0020*/         ISETP.GE.AND P0, PT, R0, c[0x0][0x38], PT;  /* 0x1b0e4000e001dc23 */
        /*0028*/     @P0 BRA.U 0xb8;                                 /* 0x40000002200081e7 */
        /*0030*/    @!P0 MOV32I R3, 0x20;                            /* 0x180000008000e1e2 */
        /*0038*/    @!P0 IMAD R22.CC, R0, R3, c[0x0][0x20];          /* 0x200780008005a0a3 */
        /*0040*/    @!P0 IMAD.HI.X R23, R0, R3, c[0x0][0x24];        /* 0x208680009005e0e3 */
        /*0048*/    @!P0 IMAD R20.CC, R0, R3, c[0x0][0x28];          /* 0x20078000a00520a3 */
        /*0050*/    @!P0 LD.E.128 R16, [R22];                        /* 0x84000000016420c5 */
        /*0058*/    @!P0 IMAD.HI.X R21, R0, R3, c[0x0][0x2c];        /* 0x20868000b00560e3 */
        /*0060*/    @!P0 LD.E.128 R8, [R22+0x10];                    /* 0x84000000416220c5 */
        /*0068*/    @!P0 IMAD R2.CC, R0, R3, c[0x0][0x30];           /* 0x20078000c000a0a3 */
        /*0070*/    @!P0 LD.E.128 R12, [R20];                        /* 0x84000000014320c5 */
        /*0078*/    @!P0 LD.E.128 R4, [R20+0x10];                    /* 0x84000000414120c5 */
        /*0080*/    @!P0 IMAD.HI.X R3, R0, R3, c[0x0][0x34];         /* 0x20868000d000e0e3 */
        /*0088*/    @!P0 DADD R14, R18, R14;                         /* 0x480000003923a001 */
        /*0090*/    @!P0 DADD R12, R16, R12;                         /* 0x4800000031032001 */
        /*0098*/    @!P0 DADD R6, R10, R6;                           /* 0x4800000018a1a001 */
        /*00a0*/    @!P0 DADD R4, R8, R4;                            /* 0x4800000010812001 */
        /*00a8*/    @!P0 ST.E.128 [R2], R12;                         /* 0x94000000002320c5 */
        /*00b0*/    @!P0 ST.E.128 [R2+0x10], R4;                     /* 0x94000000402120c5 */
        /*00b8*/         EXIT;                                       /* 0x8000000000001de7 */
                ..............................................


                Function : _Z9vectorAddPKdS0_S0_S0_S0_S0_S0_S0_PdS1_S1_S1_i
        .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
        /*0000*/         MOV R1, c[0x1][0x100];                      /* 0x2800440400005de4 */
        /*0008*/         S2R R0, SR_CTAID.X;                         /* 0x2c00000094001c04 */
        /*0010*/         S2R R2, SR_TID.X;                           /* 0x2c00000084009c04 */
        /*0018*/         IMAD R0, R0, c[0x0][0x8], R2;               /* 0x2004400020001ca3 */
        /*0020*/         ISETP.GE.AND P0, PT, R0, c[0x0][0x80], PT;  /* 0x1b0e40020001dc23 */
        /*0028*/     @P0 BRA.U 0x178;                                /* 0x40000005200081e7 */
        /*0030*/    @!P0 MOV32I R12, 0x8;                            /* 0x18000000200321e2 */
        /*0038*/    @!P0 IMAD R4.CC, R0, R12, c[0x0][0x20];          /* 0x20198000800120a3 */
        /*0040*/    @!P0 IMAD.HI.X R5, R0, R12, c[0x0][0x24];        /* 0x20988000900160e3 */
        /*0048*/    @!P0 IMAD R14.CC, R0, R12, c[0x0][0x40];         /* 0x201980010003a0a3 */
        /*0050*/    @!P0 LD.E.64 R10, [R4];                          /* 0x840000000042a0a5 */
        /*0058*/    @!P0 IMAD.HI.X R15, R0, R12, c[0x0][0x44];       /* 0x209880011003e0e3 */
        /*0060*/    @!P0 IMAD R6.CC, R0, R12, c[0x0][0x60];          /* 0x201980018001a0a3 */
        /*0068*/    @!P0 LD.E.64 R8, [R14];                          /* 0x8400000000e220a5 */
        /*0070*/    @!P0 IMAD.HI.X R7, R0, R12, c[0x0][0x64];        /* 0x209880019001e0e3 */
        /*0078*/    @!P0 IMAD R2.CC, R0, R12, c[0x0][0x28];          /* 0x20198000a000a0a3 */
        /*0080*/    @!P0 IMAD.HI.X R3, R0, R12, c[0x0][0x2c];        /* 0x20988000b000e0e3 */
        /*0088*/    @!P0 IMAD R4.CC, R0, R12, c[0x0][0x48];          /* 0x20198001200120a3 */
        /*0090*/    @!P0 IMAD.HI.X R5, R0, R12, c[0x0][0x4c];        /* 0x20988001300160e3 */
        /*0098*/    @!P0 IMAD R16.CC, R0, R12, c[0x0][0x68];         /* 0x20198001a00420a3 */
        /*00a0*/    @!P0 IMAD.HI.X R17, R0, R12, c[0x0][0x6c];       /* 0x20988001b00460e3 */
        /*00a8*/    @!P0 IMAD R14.CC, R0, R12, c[0x0][0x30];         /* 0x20198000c003a0a3 */
        /*00b0*/    @!P0 IMAD.HI.X R15, R0, R12, c[0x0][0x34];       /* 0x20988000d003e0e3 */
        /*00b8*/    @!P0 DADD R18, R10, R8;                          /* 0x4800000020a4a001 */
        /*00c0*/    @!P0 ST.E.64 [R6], R18;                          /* 0x940000000064a0a5 */
        /*00c8*/    @!P0 LD.E.64 R8, [R2];                           /* 0x84000000002220a5 */
        /*00d0*/    @!P0 LD.E.64 R10, [R4];                          /* 0x840000000042a0a5 */
        /*00d8*/    @!P0 IMAD R2.CC, R0, R12, c[0x0][0x50];          /* 0x201980014000a0a3 */
        /*00e0*/    @!P0 IMAD.HI.X R3, R0, R12, c[0x0][0x54];        /* 0x209880015000e0e3 */
        /*00e8*/    @!P0 DADD R18, R8, R10;                          /* 0x480000002884a001 */
        /*00f0*/    @!P0 IMAD R10.CC, R0, R12, c[0x0][0x70];         /* 0x20198001c002a0a3 */
        /*00f8*/    @!P0 ST.E.64 [R16], R18;                         /* 0x940000000104a0a5 */
        /*0100*/    @!P0 LD.E.64 R4, [R14];                          /* 0x8400000000e120a5 */
        /*0108*/    @!P0 LD.E.64 R6, [R2];                           /* 0x840000000021a0a5 */
        /*0110*/    @!P0 IMAD.HI.X R11, R0, R12, c[0x0][0x74];       /* 0x20988001d002e0e3 */
        /*0118*/    @!P0 IMAD R8.CC, R0, R12, c[0x0][0x38];          /* 0x20198000e00220a3 */
        /*0120*/    @!P0 IMAD.HI.X R9, R0, R12, c[0x0][0x3c];        /* 0x20988000f00260e3 */
        /*0128*/    @!P0 IMAD R2.CC, R0, R12, c[0x0][0x58];          /* 0x201980016000a0a3 */
        /*0130*/    @!P0 IMAD.HI.X R3, R0, R12, c[0x0][0x5c];        /* 0x209880017000e0e3 */
        /*0138*/    @!P0 IMAD R16.CC, R0, R12, c[0x0][0x78];         /* 0x20198001e00420a3 */
        /*0140*/    @!P0 IMAD.HI.X R17, R0, R12, c[0x0][0x7c];       /* 0x20988001f00460e3 */
        /*0148*/    @!P0 DADD R14, R4, R6;                           /* 0x480000001843a001 */
        /*0150*/    @!P0 ST.E.64 [R10], R14;                         /* 0x9400000000a3a0a5 */
        /*0158*/    @!P0 LD.E.64 R4, [R8];                           /* 0x84000000008120a5 */
        /*0160*/    @!P0 LD.E.64 R6, [R2];                           /* 0x840000000021a0a5 */
        /*0168*/    @!P0 DADD R2, R4, R6;                            /* 0x480000001840a001 */
        /*0170*/    @!P0 ST.E.64 [R16], R2;                          /* 0x940000000100a0a5 */
        /*0178*/         EXIT;                                       /* 0x8000000000001de7 */
                .................................................................



Fatbin ptx code:
================
arch = sm_20
code version = [4,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
$

Topic		Replies	Views
Coalesced Memory access related doubt CUDA Programming and Performance	13	2045	December 9, 2010
Memory access should be coalesced but is not CUDA Programming and Performance	6	1077	May 16, 2019
Using Shared Memory in CUDA C/C++ Technical Blog	36	2018	October 8, 2020
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3337	January 10, 2010
Visual debugger to see if mem access is coalesced CUDA Programming and Performance	7	1028	November 1, 2011
Coalesced read/write memory details More informations about coalesced memory CUDA Programming and Performance	10	12371	March 2, 2008
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5110	September 6, 2008
Coalesced memory access example CUDA Programming and Performance	2	3265	March 28, 2011
How to Access Global Memory Efficiently in CUDA C/C++ Kernels Technical Blog	7	659	December 5, 2019
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11708	March 20, 2009

Coalesced access to global memory for double4

Related topics