Coalescing - beginner question


I’m a beginner in CUDA and I have a question :

My (simplified) kernel :

__global__ void mykernel(float* out, float* in) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

out[idx] = in[idx] + in[idx + 1];


cudaprof tells me that the accesses are not coalesced, and I found it comes from “in[indx + 1]” by commenting this code.

I can’t understand why this is not coalesced… ! (since consecutive threads access consecutive data)

Any idea ?

Thanks !

Technically, coalescing also has an alignment requirement. What is the compute capability of your device? That will determine how much you need to worry about this.

I know that misalignment require an additional load for 1.3. However, in the CUDA profiler text mode, I get gld_incoherent = [0] for TS’s first example. However, the GPU kernel time does increase about 25% going from aligned vector addition to misligned. Why doesn’t the gld_incoherent parameter intercept this?

I believe the gld_incoherent and gst_incoherent counters are only applicable on Compute 1.0/1.1 hardware. They don’t work on GT200 or Fermi.

So what are the alternatives?

I believe that it is replaced with gld_32b, gld_64b and gld_128b. Misaligning your read should increase one of those counters.

To be more clear about the original problem: regardless of coalescing, this kernel would be a good candidate for use of shared memory to reduce the repetitive loading. (You read every element twice from global memory, when you should only need to read it once.)

This kernel will run a little more than twice as slow as a shared memory version on capability 1.2 and 1.3, and many, many times slower on compute capability 1.0 and 1.1. I think the cache on compute capability 2.0 means the kernel will run nearly full speed without shared memory.

These work with 1.3. But not with 2.0 I believe.

1 NV_Warning: Ignoring the invalid profiler config option: gld_32b

2 NV_Warning: Ignoring the invalid profiler config option: gld_64b

3 NV_Warning: Ignoring the invalid profiler config option: gld_128b

Thank you!

I used to work with a GTX275 (1.3), but now I only have a 1.1 Quadro.

So I’ve just written a version with shared memory, but my problem is that my kernel actually uses 5 arrays like that, and that requires too much shared memory (to have a reasonable number of threads/block)
I’m trying to reduce this number to 3. Anyway, thank you !

And, isn’t the texture memory a good candidate to deal with this kind of misalignment problems ?

Just forgotten a question I wanted to ask : somewhere I read that, coalescing is ensured only if within a half warp (for 1.0 - 1.3), threadIdx.y and threadIdx.z are constant. Is that true ?

A kernel called with blocks of size (8, 8, 8) for example :

int x = threadIdx.x + blockDim.x * blockIdx.x;

	int y = threadIdx.y + blockDim.y * blockIdx.y;

	int z = threadIdx.z;

	int indx = x + 8 * (y + 8*z); // 8 = blockDim.x = blockDim.y

	float f = array[indx]; //coalesced or not ?


Here it’s clear that within a half warp of 16 threads, all accesses are “coalesced”, but threadIdx.y is not constant within the half warps.

What to think ?

Correct, with compute 2.0, I think misalignment is mostly a non-issue due to the L1 cache. (Although a microbenchmark to verify that would be nice.) The only thing you can count is number of global load instructions and L1 cache hits or misses.