Hi Guys,
I am having a hard time understanding the GMem coalescing! I thought I understand those scenarios given in the “Best Practices Guide”, but when I get down to an example of using double data type, my calculated effective bandwidth using different offsets does not really match with the results given by the actual tests unfortunately.
I am running all the tests on a Tesla C1060, the data type is set to be double, I first use the codes below to find out the effective bandwidth when offset = 0, it turns out to 82.4 GB/s. Then I compute the effective bandwidth when offset=16B. As illustrated in the following figure, in this case, two fetching are required (a 128B and a 32B ) for both of the half-warp, which ends up with an efficiency of :
(128)/(128+32) = 0.8
the actual bandwidth that I got from the test is: 64.1 GB/s, which takes up about (64.1)/(82.4) = 0.78, which looks like close to the value just computed.
However, if I change the offset=4, based on the way that I am calculating, there will still be 2 fetching again (a 128B and a 32B) for both the half-warp, as illustrated in the following figure, it gonna be the same efficiency of 0.8 then. But by looking at the testing result, it is actually,
(70.3)/(82.4) = 0.85
which is obviously different from the what I have just computed!
Man, this problem really drives me crazy! Could someone give me some points on this? I am attaching the testing results of different offsets(0B, 16B and 32B) as well.
Thanks a lot for the help!
[codebox]#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include “cublas.h”
#include “cblas.h”
// includes, project
#include <cutil_inline.h>
#define threadsNum 128
#define blockNum 32768
#define loopNumber 10000
#define offsetStep 8
////////////////////////////////////////////////////////////////////////////////
// declaration, forward
global void offsetCopy(double*, double*, int offset);
int main(void)
{
int i=0;
int VLen =threadsNum * blockNum + offsetStep;
int mem_size_data = VLen * sizeof(double);
cublasStatus status;
// Device Selection
status = cudaSetDevice(1);
if (status != CUBLAS_STATUS_SUCCESS) {
printf ("!!!! Set Device error\n");
return EXIT_FAILURE;
}
// Allocate the device memory
double* d_idata;
status = cudaMalloc((void**) &d_idata, mem_size_data);
if (status != CUBLAS_STATUS_SUCCESS) {
printf ("!!!! Allocate memory (d_idata) error\n");
return EXIT_FAILURE;
}
double* d_odata;
status = cudaMalloc((void**) &d_odata, mem_size_data);
if (status != CUBLAS_STATUS_SUCCESS) {
printf ("!!!! Allocate memory (d_odata) error\n");
return EXIT_FAILURE;
}
dim3 threads(threadsNum, 1);
dim3 grid(blockNum, 1);
// create and start timer
float timeCost;
cudaEvent_t start, end;
cudaEventCreate(&start);
cudaEventCreate(&end);
cudaEventRecord(start, 0);
for (i=0; i<= loopNumber; i++)
{
offsetCopy<<<grid, threads>>>(d_odata, d_idata, offsetStep);
}
cudaEventRecord(end, 0);
cudaEventSynchronize(end);
cudaEventElapsedTime(&timeCost, start, end);
timeCost = timeCost/loopNumber;
printf("\nOffset = %d : time cost: %f (ms), Effective Bandwidth %f GB/s \n",offsetStep, timeCost,
2*mem_size_data/(timeCost*1.0e6));
printf("\n\nTest Mem Finished....\n\n");
}
global void offsetCopy(double *odata, double *idata, int offset)
{
int xid = blockIdx.x * blockDim.x + threadIdx.x + offset;
odata[xid] = idata[xid];
}[/codebox]
[codebox]
Offset = 0 : time cost: 0.814435 (ms), Effective Bandwidth 82.399251 GB/s
Offset = 2 : time cost: 1.047461 (ms), Effective Bandwidth 64.068155 GB/s
Offset = 4 : time cost: 0.954634 (ms), Effective Bandwidth 70.298069 GB/s [/codebox]