I had earlier posted this in the Overclocking and Benchmarking forum but I think this is a more appropriate forum for my doubt.
I am a newbie to GPGPU programming and was trying to write a benchmark to test the write bandwidth on GTX 480 (device to device BW). Nvidia states a d2d BW of 177 GBps however i ran a small micro kernel which predominantly does writes to global memory and i am getting a much higher figure (around 190 GBps). Is this possible?
I think i might be making some mistake in my calculations or somehow i have my code wrong.
[codebox]global void MicroBM_BandwidthTest(float* A, int max_comp)
int j; int max_comp = 1000; int i = blockDim.x * blockIdx.x + threadIdx.x; // currently not using any shared memory --> but i don't think that is required for(j = 0;j < max_comp; j++) A[i]++;
Now, according to the above kernel as all memory is global hence there will be 1 global load and 1000 global stores for each thread and each data element.
The formula for calculating the BW is given below -
float bwinMBps = (num_ld_st /* Number of load stores in the kernel = 1001 */* ARRAY_SIZE /* Total data transfer */) * 1e3f /* Because time is in milli seconds */ / (elapsedTimeInMs * (float)(1 << 20) /* Divide 1 M */);
By this calculation i get a write BW (d2d) of 190 GBps.
Where am i going wrong?
All help is appreciated.