GMem coalescing bandwidth for double data

Hi Guys,

I am having a hard time understanding the GMem coalescing! I thought I understand those scenarios given in the “Best Practices Guide”, but when I get down to an example of using double data type, my calculated effective bandwidth using different offsets does not really match with the results given by the actual tests unfortunately.

I am running all the tests on a Tesla C1060, the data type is set to be double, I first use the codes below to find out the effective bandwidth when offset = 0, it turns out to 82.4 GB/s. Then I compute the effective bandwidth when offset=16B. As illustrated in the following figure, in this case, two fetching are required (a 128B and a 32B ) for both of the half-warp, which ends up with an efficiency of :

(128)/(128+32) = 0.8

the actual bandwidth that I got from the test is: 64.1 GB/s, which takes up about (64.1)/(82.4) = 0.78, which looks like close to the value just computed.

However, if I change the offset=4, based on the way that I am calculating, there will still be 2 fetching again (a 128B and a 32B) for both the half-warp, as illustrated in the following figure, it gonna be the same efficiency of 0.8 then. But by looking at the testing result, it is actually,

(70.3)/(82.4) = 0.85

which is obviously different from the what I have just computed!

Man, this problem really drives me crazy! Could someone give me some points on this? I am attaching the testing results of different offsets(0B, 16B and 32B) as well.

Thanks a lot for the help!

[codebox]#include <stdio.h>

#include <stdlib.h>

#include <math.h>

#include “cublas.h”

#include “cblas.h”

// includes, project

#include <cutil_inline.h>

#define threadsNum 128

#define blockNum 32768

#define loopNumber 10000

#define offsetStep 8


// declaration, forward

global void offsetCopy(double*, double*, int offset);

int main(void)


int i=0;

int VLen =threadsNum * blockNum + offsetStep; 

int mem_size_data = VLen * sizeof(double);

cublasStatus status;

 // Device Selection

status = cudaSetDevice(1);

if (status != CUBLAS_STATUS_SUCCESS) {

    printf ("!!!! Set Device error\n");

    return EXIT_FAILURE;



// Allocate the device memory

double* d_idata;

status = cudaMalloc((void**) &d_idata, mem_size_data);

if (status != CUBLAS_STATUS_SUCCESS) {

    printf ("!!!! Allocate memory (d_idata) error\n");

    return EXIT_FAILURE;


double* d_odata;

status = cudaMalloc((void**) &d_odata, mem_size_data);

if (status != CUBLAS_STATUS_SUCCESS) {

    printf ("!!!! Allocate memory (d_odata) error\n");

    return EXIT_FAILURE;


dim3 threads(threadsNum, 1);

dim3 grid(blockNum, 1);

// create and start timer

float timeCost;

cudaEvent_t start, end;



cudaEventRecord(start, 0);

for (i=0; i<= loopNumber; i++)


	offsetCopy<<<grid, threads>>>(d_odata, d_idata, offsetStep);



cudaEventRecord(end, 0);


cudaEventElapsedTime(&timeCost, start, end);

timeCost = timeCost/loopNumber;


printf("\nOffset = %d : time cost: %f (ms), Effective Bandwidth %f GB/s  \n",offsetStep, timeCost,  


printf("\n\nTest Mem Finished....\n\n");


global void offsetCopy(double *odata, double *idata, int offset)


int xid = blockIdx.x * blockDim.x + threadIdx.x + offset;

odata[xid] = idata[xid];



Offset = 0 : time cost: 0.814435 (ms), Effective Bandwidth 82.399251 GB/s

Offset = 2 : time cost: 1.047461 (ms), Effective Bandwidth 64.068155 GB/s

Offset = 4 : time cost: 0.954634 (ms), Effective Bandwidth 70.298069 GB/s [/codebox]