You cannot get the full bandwidth of GDDR5 GPU memory, just like you cannot get the full bandwidth of a DDR4 system memory in a benchmark. Expect to max out at around 80% of the theoretical bandwidth. The rules for maximum bandwidth are basically: (1) All accesses coalesced (2) Each thread makes 128-bit accesses (best use of limited-depth load/store queue). The simple kernel below will do that (configure to taste e.g. blocks = 65520, treads/block = 128, len=100000000).
__global__ void zcopy (const double2 * __restrict__ src, double2 * __restrict__ dst, int len)
{
int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = tid; i < len; i += stride) {
dst[i] = src[i];
}
}
Note that various performance issues have been reported with GDDR5X memory (search this forum for details).