3 x 16 thread block runs faster than 16 x 3 why is that?

Hi all,

I have a kernel that uses a fixed number of blocks in a grid and a fixed number of threads in a block. The number of threads is 48 so I try running the kernel with a 3 x 16 block and also with a 16 x 3 block. The strage thing is that the profiler shows 25% occupancy in both cases, yet the 16 x 3 version runs about 50% slower.

It is especially strange because Mark said in another post

Which is exactly the opposite of what I’m seeing, so I’m kind of confused what’s going on. Any ideas?

(Both cubin files show reg = 18 and smem = 4256.)

Daniel

It’s possible that the 3x16 arrangement is better in your particular case because of a different memory access pattern. Maybe it avoids bank conflicts, where 16x3 has many? There may also be related issues when accessing global memory due to coalescing.

Paulius

Thanks Paulius, the bank conflict issue is the most probable explanation, I’ll look into that.

You mention memory coalescing, what does this exactly mean and how does it effect performance? Sorry, probably I’m asking trivial things here but the wikipedia entry “Coalescence (computer science)” is empty :(

This refers to the ability of the hardware to read/write more than one data word at once. That is the G80 can transfer 128 Bit with a single call. So one call can service an entire struct or for example 4 threads with 32 Bit each. See the manual section 6.1.2.1

Peter

Thanks Peter! It seems that as long as I don’t need to write memory from the kernel only read it it’s much better to go with texture fetches from array memory, no need to worry about this coalescence and alignment business. Anyway, I start to understand why my CUDA implementations are still 4-5 times slower than the OpenGL/Cg versions.

Daniel

Along this same thread (3x16 or 16x3). I have a situation where I need to access a 3D array in all three directions (x varies fastest, z slowest). The bulk of the time is spent accessing data in the z direction. With my block size of 128, then can I use say a 4x32 or 32x4 to get 4 z values that are next to each other? Which to use? I would think 4x32 would be better in this case.

My approach is just to try all possible combinations and pick the fastest. Sometimes the result is kind of unexpected, at least to me :) For example the following is a table of different execution times (in seconds, running the kernel 200 times) of the same kernel with different number of threads per block:

32 x 3 time 21.4
24 x 3 time 14.7
16 x 3 time 11.6
8 x 3 time 7.6
4 x 3 time 4.4
2 x 3 time 5.56
1 x 3 time 8.78

3 x 32 time 7.63
3 x 24 time 7.02
3 x 16 time 5.98
3 x 8 time 5.08
3 x 4 time 4.34
3 x 2 time 5.54
3 x 1 time 8.79

Yes this is also my experience: you cannot get around extensive testing.

BTW, using textures needs more registers. Look at the .ptx file. Every texcall needs to setup quite some stuff before it calls the texunit. The extended register usage might allow for less occupancy than the lean device memory fetches. So in cases where coalescing works properly, I have seen reading from device memory perform as fast as texture fetches.

Peter

Thanks for telling that. I already have a problem with too many registers per block, this is the main factor that limits occupancy at the moment. Anyway, there will be no other way than writing a version with textures and another one with good coalescing and see which one is better.