I’m rather new to CUDA development and I’ve stumbled on to some odd performance issues when using striped vs coalesced memory access. The kernel I’m playing with is as follows:
I’m calling this kernel with 8192 blocks of 128 threads (2^20 or ~1mil threads total).
As is the kernel runs in 0.0256 ms but if I comment out line 18 it takes 9.9717 ms (390 times longer).
By removing line 18 we’re going from coalesced to striped access, but I was under the impression that the memory manager would just read the whole block its trying to access regardless. Since we’re striping over the same range shouldn’t we expect the same number of memory accesses and similar performance?
Setup:
GTX 980 Ti
CUDA 7.0.28
NVIDIA x86_64-352.21
Each iteration has a global memory access!
Why is there a loop anyway; the variable gets overwritten.
If the original intent was to accumulate values in the variable, use a local and write out once. By the way, you still would not need a loop as the result would be directly computable.
// Kernel
__global__ void testKernel(float* d_array, int size){
// Index
int idx = blockIdx.x *blockDim.x + threadIdx.x;
// Initialize
d_array[idx] = idx + 1000 + (idx & 0x01);
// Loop - WHY ???
/*
for(int i = 0; i < 1000; ++i){
// Condition
if(idx % 2 == 0){
d_array[idx] = idx + i + 1;
}
else{
d_array[idx] = idx + i + 2;
}
}
*/
}