Here’s a simple kernel:
__global__ void reading0 (T* __restrict__ target, const T * __restrict__ src, int len, int n)
{
int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
T x ;
for (int i = tid; i < len; i += stride)
{
if (i % 32 < n)
x = __ldg(src + i);
// we assume x is always == 0, so the next GMEM assignment should never take place:
// this is to fool the compiler so as it does not optimze the whole kernel out
if (x < 0) {
target[i] = x;
printf("ERROR\n");
}
}
}
The idea is that n controls how many threads in a warp actually request data from the global memory.
This code is run for T=float and T=double on K20M, ECC off for the same number of array elements.
One might expect that for double the execution time should be twice the time for float. It is not.
The results are here:
As expected, the overall execution time for floats is almost independent of n, although 32-byte granularity of L2 is visible. For double this time is also almost constant and equal to that for float if n <= 16; for larger n one might expect the time to increase by a factor of two (in theory the memory is read in 128-byte segments and for n > 16 an extra segment is required). However,
the time increases VERY SLOWLY, for n<24 only by 10% rather than 100%. Again one can see the 32-byte granurality of L2. Do we also see the 384-bit (=48 byte, 12 floats, 6 doubles) memory bus?
Now look at this:
It was computed using a similar kernel that reads TWO streams of data, float/double + int.
Note a very similar pattern to that in the previous figure.
Note also that reading an array of double is MUCH faster tha reading 2 independent arrays of
4-byte data (float+int)
Now have a look at this:
The green starcase is what one would expect after reading nvidia CUDA documentation (the GMEM is accessed in 128-byte segments); the blue straight line shows that an array of doubles is effectively accesed in 256-byte segements.
Can someone from nvidia comment on this? How accessing doubles differs from accessing 4-byte streams of data? Do you use any “magic” to speed up 8-byte transfers? Do you use half-warps? Why the second half-warp behaves so differently than the first one? The most important question for me: will it be OK to assume that effectively the memory segements are 256-byte long for arrays of 8-byte data?
Z Koza