Coalescing - beginner question

Cuda_Libre · June 23, 2010, 2:16pm

Hello,

I’m a beginner in CUDA and I have a question :

My (simplified) kernel :

__global__ void mykernel(float* out, float* in) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

out[idx] = in[idx] + in[idx + 1];

}

cudaprof tells me that the accesses are not coalesced, and I found it comes from “in[indx + 1]” by commenting this code.

I can’t understand why this is not coalesced… ! (since consecutive threads access consecutive data)

Any idea ?

Thanks !

seibert · June 23, 2010, 3:47pm

Technically, coalescing also has an alignment requirement. What is the compute capability of your device? That will determine how much you need to worry about this.

BlahCuda · June 23, 2010, 4:52pm

I know that misalignment require an additional load for 1.3. However, in the CUDA profiler text mode, I get gld_incoherent = [0] for TS’s first example. However, the GPU kernel time does increase about 25% going from aligned vector addition to misligned. Why doesn’t the gld_incoherent parameter intercept this?

avidday · June 23, 2010, 4:58pm

I believe the gld_incoherent and gst_incoherent counters are only applicable on Compute 1.0/1.1 hardware. They don’t work on GT200 or Fermi.

BlahCuda · June 23, 2010, 4:59pm

So what are the alternatives?

seibert · June 23, 2010, 5:52pm

I believe that it is replaced with gld_32b, gld_64b and gld_128b. Misaligning your read should increase one of those counters.

seibert · June 23, 2010, 5:57pm

To be more clear about the original problem: regardless of coalescing, this kernel would be a good candidate for use of shared memory to reduce the repetitive loading. (You read every element twice from global memory, when you should only need to read it once.)

This kernel will run a little more than twice as slow as a shared memory version on capability 1.2 and 1.3, and many, many times slower on compute capability 1.0 and 1.1. I think the cache on compute capability 2.0 means the kernel will run nearly full speed without shared memory.

BlahCuda · June 23, 2010, 6:09pm

These work with 1.3. But not with 2.0 I believe.

1 NV_Warning: Ignoring the invalid profiler config option: gld_32b

2 NV_Warning: Ignoring the invalid profiler config option: gld_64b

3 NV_Warning: Ignoring the invalid profiler config option: gld_128b

Cuda_Libre · June 23, 2010, 6:18pm

Thank you!

I used to work with a GTX275 (1.3), but now I only have a 1.1 Quadro.

So I’ve just written a version with shared memory, but my problem is that my kernel actually uses 5 arrays like that, and that requires too much shared memory (to have a reasonable number of threads/block)
I’m trying to reduce this number to 3. Anyway, thank you !

And, isn’t the texture memory a good candidate to deal with this kind of misalignment problems ?

Cuda_Libre · June 23, 2010, 6:28pm

Just forgotten a question I wanted to ask : somewhere I read that, coalescing is ensured only if within a half warp (for 1.0 - 1.3), threadIdx.y and threadIdx.z are constant. Is that true ?

A kernel called with blocks of size (8, 8, 8) for example :

int x = threadIdx.x + blockDim.x * blockIdx.x;

	int y = threadIdx.y + blockDim.y * blockIdx.y;

	int z = threadIdx.z;

	int indx = x + 8 * (y + 8*z); // 8 = blockDim.x = blockDim.y

	float f = array[indx]; //coalesced or not ?

	[...]

Here it’s clear that within a half warp of 16 threads, all accesses are “coalesced”, but threadIdx.y is not constant within the half warps.

What to think ?

seibert · June 23, 2010, 7:25pm

Correct, with compute 2.0, I think misalignment is mostly a non-issue due to the L1 cache. (Although a microbenchmark to verify that would be nice.) The only thing you can count is number of global load instructions and L1 cache hits or misses.

Topic		Replies	Views
How can I identify where coalescing can be done? CUDA Programming and Performance	7	5163	September 18, 2008
Coalescing memory accesses Need help with coalescing CUDA Programming and Performance	2	1179	March 30, 2009
Misaligned starting address for memory coalescing CUDA Programming and Performance	4	3626	March 31, 2011
Coalesced memory access example CUDA Programming and Performance	2	3295	March 28, 2011
Problem with coalesced memory access CUDA Programming and Performance	2	2777	June 23, 2008
Need help on non-coalesced access CUDA Programming and Performance	0	1130	May 9, 2009
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11770	March 20, 2009
Kernel has 0 coalesced reads/writes... Profiler reveals my newbness CUDA Programming and Performance	1	1088	February 18, 2009
Correct understanding coalesced memory loading? CUDA Programming and Performance	7	5313	July 30, 2008
gld coalesced = 0, but addresses are aligned! CUDA Programming and Performance	10	1661	March 20, 2010

Coalescing - beginner question

Related topics