I use a Geforce 8800 gtx .For it I believe the theoretical “device”-to-“device”(?) bandwidth is 86.4 GB/s here.Now I want to know how to precisely measure the time to use in effective bandwidth calculation.I assume this is relating to device -device transfers so to calculate the time in my transfers should I just take the difference in calling the kernel with all my instructions versus calling a empty kernel (one with no instructions)?
Also for calculating my total number of bytes read and written for a kernel like :(downsamples a x into y image to x/2 into y/2)
int id=blockIdx.x * blockDim.x*blockDim.y+ threadIdx.y*blockDim.x+threadIdx.x+blockIdx.y*gridDim.x*blockDim.x*blockDim.y;
int number=2*(id%(width/2))+(id/(width/2))*width*2;
if (id<height*width/4){
f_r[id]=(r_d[number]+r_d[number+1]+r_d[number+width]+r_d[number+width+1])/4;
f_g[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1])/4;
f_b[id]=(b_d[number]+b_d[number+1]+b_d[number+width]+b_d[number+width+1])/4;
}
}
used this should it be : xy3(for r,g,b)4(for int) and write = 1/4 xy3(for r,g,b)*4(for int)?