3 x 16 thread block runs faster than 16 x 3 why is that?

nogradi · April 2, 2007, 12:29pm

Hi all,

I have a kernel that uses a fixed number of blocks in a grid and a fixed number of threads in a block. The number of threads is 48 so I try running the kernel with a 3 x 16 block and also with a 16 x 3 block. The strage thing is that the profiler shows 25% occupancy in both cases, yet the 16 x 3 version runs about 50% slower.

It is especially strange because Mark said in another post

Which is exactly the opposite of what I’m seeing, so I’m kind of confused what’s going on. Any ideas?

(Both cubin files show reg = 18 and smem = 4256.)

Daniel

paulius · April 2, 2007, 2:51pm

It’s possible that the 3x16 arrangement is better in your particular case because of a different memory access pattern. Maybe it avoids bank conflicts, where 16x3 has many? There may also be related issues when accessing global memory due to coalescing.

Paulius

nogradi · April 2, 2007, 3:13pm

Thanks Paulius, the bank conflict issue is the most probable explanation, I’ll look into that.

You mention memory coalescing, what does this exactly mean and how does it effect performance? Sorry, probably I’m asking trivial things here but the wikipedia entry “Coalescence (computer science)” is empty :(

prkipfer · April 2, 2007, 3:23pm

This refers to the ability of the hardware to read/write more than one data word at once. That is the G80 can transfer 128 Bit with a single call. So one call can service an entire struct or for example 4 threads with 32 Bit each. See the manual section 6.1.2.1

Peter

nogradi · April 2, 2007, 3:45pm

Thanks Peter! It seems that as long as I don’t need to write memory from the kernel only read it it’s much better to go with texture fetches from array memory, no need to worry about this coalescence and alignment business. Anyway, I start to understand why my CUDA implementations are still 4-5 times slower than the OpenGL/Cg versions.

Daniel

Willaim · April 2, 2007, 4:18pm

Along this same thread (3x16 or 16x3). I have a situation where I need to access a 3D array in all three directions (x varies fastest, z slowest). The bulk of the time is spent accessing data in the z direction. With my block size of 128, then can I use say a 4x32 or 32x4 to get 4 z values that are next to each other? Which to use? I would think 4x32 would be better in this case.

nogradi · April 3, 2007, 6:22am

My approach is just to try all possible combinations and pick the fastest. Sometimes the result is kind of unexpected, at least to me :) For example the following is a table of different execution times (in seconds, running the kernel 200 times) of the same kernel with different number of threads per block:

32 x 3 time 21.4
24 x 3 time 14.7
16 x 3 time 11.6
8 x 3 time 7.6
4 x 3 time 4.4
2 x 3 time 5.56
1 x 3 time 8.78

3 x 32 time 7.63
3 x 24 time 7.02
3 x 16 time 5.98
3 x 8 time 5.08
3 x 4 time 4.34
3 x 2 time 5.54
3 x 1 time 8.79

prkipfer · April 3, 2007, 10:40am

Yes this is also my experience: you cannot get around extensive testing.

BTW, using textures needs more registers. Look at the .ptx file. Every texcall needs to setup quite some stuff before it calls the texunit. The extended register usage might allow for less occupancy than the lean device memory fetches. So in cases where coalescing works properly, I have seen reading from device memory perform as fast as texture fetches.

Peter

nogradi · April 3, 2007, 11:38am

Thanks for telling that. I already have a problem with too many registers per block, this is the main factor that limits occupancy at the moment. Anyway, there will be no other way than writing a version with textures and another one with good coalescing and see which one is better.

Topic		Replies	Views
Coalescing CUDA Programming and Performance	6	3572	May 12, 2008
better performance from underpopulated warps CUDA Programming and Performance	6	2439	June 28, 2008
GPU profiling 33% occupancy faster then 50-66% CUDA Programming and Performance	2	3313	March 13, 2007
memory accesses by thread block accessing memory by thread block is only semi-coalesced? CUDA Programming and Performance	7	3771	February 16, 2009
Different run times depending on axis? CUDA Programming and Performance	0	4165	March 13, 2007
Question regarding warp efficiency... CUDA Programming and Performance	9	15109	March 13, 2007
too large kernel solutions CUDA Programming and Performance	11	4281	September 2, 2008
Block dim discussion 1D vs 2D CUDA Programming and Performance	8	8346	August 14, 2007
Disappointing shared memory performance CUDA Programming and Performance	3	737	September 8, 2011
Best way to allocate a small lookup table 2KB of data, read only CUDA Programming and Performance	7	2785	March 22, 2011

3 x 16 thread block runs faster than 16 x 3 why is that?

Related topics