We know that device to device band width of Quadro FX 5600 is 80GBs/ peak.
In my opinion the transfer operation includes two operations
-
read from global source memory to register (or local shared)
-
write value from register to the global destination memory
So actually some simple operation like add a constant to the source memory and write result to the global memory should yields the same bandwidth . However when i try a simple kernel
template<class T>
__global__ void cuvppMul_C1( T* g_idata, T* g_odata, const T s)
{
const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
g_odata[idx] = g_idata[idx] * s;
}
and measure the bandwidth, with len = 1 << 22
nIters = 10000, num_threads = 128
dim3 threads(num_threads, 1, 1);
dim3 grid(len/num_threads, 1, 1);
cuvppMul_C1<T><<< grid, threads >>>(d_idata, d_odata, 4.5f);
cudaThreadSynchronize();
CUT_SAFE_CALL( cutResetTimer( timer));
// execute the kernel
for (int i=0; i < nIters; ++i){
CUT_SAFE_CALL( cutStartTimer( timer));
cuvppMul_C1<T><<< grid, threads >>>(d_idata, d_odata, 4.5f);
cudaThreadSynchronize();
CUT_SAFE_CALL( cutStopTimer( timer));
}
runTime = cutGetAverageTimerValue(timer);
printf("Average time: %f ms\n", runTime);
printf("Bandwidth: %f GB/s\n\n", (len * sizeof(T)) / (runTime * 1.0e6));
I can only get 30GB. Why i can not get the peak rate. How can i reach the peak rate.
Any idea is appreciated