Technically, coalescing also has an alignment requirement. What is the compute capability of your device? That will determine how much you need to worry about this.
I know that misalignment require an additional load for 1.3. However, in the CUDA profiler text mode, I get gld_incoherent = [0] for TS’s first example. However, the GPU kernel time does increase about 25% going from aligned vector addition to misligned. Why doesn’t the gld_incoherent parameter intercept this?
To be more clear about the original problem: regardless of coalescing, this kernel would be a good candidate for use of shared memory to reduce the repetitive loading. (You read every element twice from global memory, when you should only need to read it once.)
This kernel will run a little more than twice as slow as a shared memory version on capability 1.2 and 1.3, and many, many times slower on compute capability 1.0 and 1.1. I think the cache on compute capability 2.0 means the kernel will run nearly full speed without shared memory.
I used to work with a GTX275 (1.3), but now I only have a 1.1 Quadro.
So I’ve just written a version with shared memory, but my problem is that my kernel actually uses 5 arrays like that, and that requires too much shared memory (to have a reasonable number of threads/block)
I’m trying to reduce this number to 3. Anyway, thank you !
And, isn’t the texture memory a good candidate to deal with this kind of misalignment problems ?
Just forgotten a question I wanted to ask : somewhere I read that, coalescing is ensured only if within a half warp (for 1.0 - 1.3), threadIdx.y and threadIdx.z are constant. Is that true ?
A kernel called with blocks of size (8, 8, 8) for example :
int x = threadIdx.x + blockDim.x * blockIdx.x;
int y = threadIdx.y + blockDim.y * blockIdx.y;
int z = threadIdx.z;
int indx = x + 8 * (y + 8*z); // 8 = blockDim.x = blockDim.y
float f = array[indx]; //coalesced or not ?
[...]
Here it’s clear that within a half warp of 16 threads, all accesses are “coalesced”, but threadIdx.y is not constant within the half warps.
Correct, with compute 2.0, I think misalignment is mostly a non-issue due to the L1 cache. (Although a microbenchmark to verify that would be nice.) The only thing you can count is number of global load instructions and L1 cache hits or misses.