More threads/block increase kernel execution time. WHY?

freda82 · June 14, 2011, 4:18pm

Hi All,

I’m currently running the simpleGL example from the SDK and I’ve been playing around with different block sizes and I have encountered something that I hope someone can explain to me.

The original block size in the example is set to 8x8 = 64 threads and when I run ptxas option I get 14 registers used per thread and the occupancy calculator gives a occupancy of 50 % on each multiprocessor since we can only have 8 resident blocks on each multiprocessor at a given time and this gives us 512 active threads. So my idea was to set the threads/block to 128 since then we would have 8 resident blocks per multiprocessor and 1024 active threads and this I thought would increase the execution time of the kernel but no…

So my question is: Does anyone know why this is the case? The block dim will be 16x8 when running 128 threads/block, can this affect latency of memory writes in the kernel?

Additional information:

The mesh size is 256x256 and from the cudaprofiler i get an instruction throughput of 1.04115 for threads/block 8x8
and 0,967362 for thread/blocks 16x8.

My graphics card : Quadro FX 1800M

Thank you in advance!=)

freda82 · June 14, 2011, 4:18pm

Hi All,

I’m currently running the simpleGL example from the SDK and I’ve been playing around with different block sizes and I have encountered something that I hope someone can explain to me.

The original block size in the example is set to 8x8 = 64 threads and when I run ptxas option I get 14 registers used per thread and the occupancy calculator gives a occupancy of 50 % on each multiprocessor since we can only have 8 resident blocks on each multiprocessor at a given time and this gives us 512 active threads. So my idea was to set the threads/block to 128 since then we would have 8 resident blocks per multiprocessor and 1024 active threads and this I thought would increase the execution time of the kernel but no…

So my question is: Does anyone know why this is the case? The block dim will be 16x8 when running 128 threads/block, can this affect latency of memory writes in the kernel?

Additional information:

The mesh size is 256x256 and from the cudaprofiler i get an instruction throughput of 1.04115 for threads/block 8x8
and 0,967362 for thread/blocks 16x8.

My graphics card : Quadro FX 1800M

Thank you in advance!=)

Skybuck · June 14, 2011, 6:30pm

As thread number goes up, fewer registers are available and also fewer shared memory per thread.

So perhaps that’s at play… though from specs it would seem there is plenty of registers.

Try running some device query tool and post precise specs here… plus perhaps some numbers about your kernels… like how many register per thread but also shared memory and such…

Skybuck · June 14, 2011, 6:30pm

As thread number goes up, fewer registers are available and also fewer shared memory per thread.

So perhaps that’s at play… though from specs it would seem there is plenty of registers.

Try running some device query tool and post precise specs here… plus perhaps some numbers about your kernels… like how many register per thread but also shared memory and such…

freda82 · June 15, 2011, 8:26am

Hi Skybuck!

First of, thank you for taking an interest in this=)

I was also thinking that it could be limited by the number of registers used per thread. I don’t use any dynamically allocated shared memory my self but as I mentioned in my last post I use 14 registers/thread so that should not be a bottleneck (as you also mentioned above) (1024 threads * 14 = 14336 (max 16384) ) and I only use 16 + 16 bytes of shared memory for the kernel.

My other thought was that the memory access should be noncoalesced if I changed the block size but since I use float4 there should be no unused memory instruction when writing to the texture…

Here are the specs for my board:

ptxas info: Used 14 registers, 56+0 bytes lmem, 16+16 bytes smem, 24 bytes cmem[0], 112 bytes cmem[1]

CUDA Driver Version : 4.0

CUDA Runtime Version: 3.0

CUDA Capability : 1.2

Number of multicores: 9

Number of cores: 72

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total amount of registers available per block: 16384 bytes

Max threads per block: 512

freda82 · June 15, 2011, 8:26am

Hi Skybuck!

First of, thank you for taking an interest in this=)

I was also thinking that it could be limited by the number of registers used per thread. I don’t use any dynamically allocated shared memory my self but as I mentioned in my last post I use 14 registers/thread so that should not be a bottleneck (as you also mentioned above) (1024 threads * 14 = 14336 (max 16384) ) and I only use 16 + 16 bytes of shared memory for the kernel.

My other thought was that the memory access should be noncoalesced if I changed the block size but since I use float4 there should be no unused memory instruction when writing to the texture…

Here are the specs for my board:

ptxas info: Used 14 registers, 56+0 bytes lmem, 16+16 bytes smem, 24 bytes cmem[0], 112 bytes cmem[1]

CUDA Driver Version : 4.0

CUDA Runtime Version: 3.0

CUDA Capability : 1.2

Number of multicores: 9

Number of cores: 72

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total amount of registers available per block: 16384 bytes

Max threads per block: 512

Skybuck · June 15, 2011, 8:20pm

So you increased the number of total threads from 8x8 to 256x256 ? Otherwise provide some more details.

Perhaps it’s time for some code to look it External Image

Perhaps changing the thread layout might change the memory access pattern.

It’s probably best to access each 32 bits by a different thread for maximum memory bank usage. However 1.2 is a bit different than 2.0, but should still be good advice.

Each next 32 bits in memory is served by a different bank.

If this has something to do with your case is hard to say… perhaps not… but could be something to look at… so some code might give further insights if anything changed or so External Image

Also from what I understand from the manual it seems best to have “odd” indexing.

So a thread should move it’s index from 1 to 3 to 5 to 7 if I am not mistaken… this would probably cause maximum bank usage (so least amount of bank conflicts)

Even if strides are big… (I think with strides is ment “gaps” between memory/indexes)

So even a stride of 7 or 101 or so… should still be better than a stride of 8 or 102 or something like that. Though 8 is still good for compute 2.0. (64 bit doubles and integers). But 8 is bad for compute 1.1 or so.

Skybuck · June 15, 2011, 8:20pm

So you increased the number of total threads from 8x8 to 256x256 ? Otherwise provide some more details.

Perhaps it’s time for some code to look it External Image

Perhaps changing the thread layout might change the memory access pattern.

It’s probably best to access each 32 bits by a different thread for maximum memory bank usage. However 1.2 is a bit different than 2.0, but should still be good advice.

Each next 32 bits in memory is served by a different bank.

If this has something to do with your case is hard to say… perhaps not… but could be something to look at… so some code might give further insights if anything changed or so External Image

Also from what I understand from the manual it seems best to have “odd” indexing.

So a thread should move it’s index from 1 to 3 to 5 to 7 if I am not mistaken… this would probably cause maximum bank usage (so least amount of bank conflicts)

Even if strides are big… (I think with strides is ment “gaps” between memory/indexes)

So even a stride of 7 or 101 or so… should still be better than a stride of 8 or 102 or something like that. Though 8 is still good for compute 2.0. (64 bit doubles and integers). But 8 is bad for compute 1.1 or so.

freda82 · June 16, 2011, 7:44am

Hi!

I increased the total number of threads/block to 8x16 (128 threads/block). Sorry if I’m being unclear=)

Bank conflicts do not however have anything to do with this since I do not use any shared memory inside the kernel.

This depends on what you store in the shared memory and how you access the shared memory =)

Here is the code from the kernel

__global__ void modify_vertex_pos_kernel(float4* pos, unsigned int width, unsigned int height, float time)

{

    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

// calculate uv coordinates

    float u = x / (float) width;

    float v = y / (float) height;

    u = u*2.0f - 1.0f;

    v = v*2.0f - 1.0f;

// calculate simple sine wave pattern

    float freq = 4.0f;

    float w = sinf(u*freq + time) * cosf(v*freq + time) * 0.5f;

// write output vertex

    pos[y*width+x] = make_float4(u, w, v, 1.0f);

}

My first concern was that when I changed the blocks dim from 8x8 to 8x16 I would have uncoalesced memeory access but I don’t think that this is the case:

Example:

Block dim 8x8:

The first half warp should load the following threads:

IDwarp = x + Dx*y

IDwarp = 0…15 gives x = 0…7 and y = 0…2

Since every thread copies a float4 each thread will perform a 4 bytes4 (16 bytes) memory transaction and for a half warp this will be a 1616 byte = 256 bytes. This would generate two 128 byte memory transfer per half warp with no unused memory addressing.

If the block dim should be 8x16 the same should be satisfied. Do you agree?

I’m still kind of puzzled why the block dim 8x8 is faster then a block dim 8x16… We should have more threads active per multiprocessor if we have 128 threads per block (1024 active threads) instead of 64 threads per block (512 active threads).

Any idee?=)

freda82 · June 16, 2011, 7:44am

Hi!

I increased the total number of threads/block to 8x16 (128 threads/block). Sorry if I’m being unclear=)

Bank conflicts do not however have anything to do with this since I do not use any shared memory inside the kernel.

This depends on what you store in the shared memory and how you access the shared memory =)

Here is the code from the kernel

__global__ void modify_vertex_pos_kernel(float4* pos, unsigned int width, unsigned int height, float time)

{

    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

// calculate uv coordinates

    float u = x / (float) width;

    float v = y / (float) height;

    u = u*2.0f - 1.0f;

    v = v*2.0f - 1.0f;

// calculate simple sine wave pattern

    float freq = 4.0f;

    float w = sinf(u*freq + time) * cosf(v*freq + time) * 0.5f;

// write output vertex

    pos[y*width+x] = make_float4(u, w, v, 1.0f);

}

My first concern was that when I changed the blocks dim from 8x8 to 8x16 I would have uncoalesced memeory access but I don’t think that this is the case:

Example:

Block dim 8x8:

The first half warp should load the following threads:

IDwarp = x + Dx*y

IDwarp = 0…15 gives x = 0…7 and y = 0…2

Since every thread copies a float4 each thread will perform a 4 bytes4 (16 bytes) memory transaction and for a half warp this will be a 1616 byte = 256 bytes. This would generate two 128 byte memory transfer per half warp with no unused memory addressing.

If the block dim should be 8x16 the same should be satisfied. Do you agree?

I’m still kind of puzzled why the block dim 8x8 is faster then a block dim 8x16… We should have more threads active per multiprocessor if we have 128 threads per block (1024 active threads) instead of 64 threads per block (512 active threads).

Any idee?=)

Skybuck · June 16, 2011, 9:31am

I think you should try and change the dim to 8x8 or perhaps even 16x4 or even 32x2 or even 64x1

The mesh is 256x256 so if the dim is for example 64x1 then the following happens:

256/64 = 4, 256/1 = 256

So there will be 4 horizontal blocks and 256 vertical blocks (maybe that don’t matter)

The number of threads per block will remain 64 and a total of 1024 blocks.

Perhaps the block ordering does matter you could try and swap the x and y like so:

dim3 grid(mesh_width / block.y, mesh_height / block.x, 1);

this would then give
256 / 1 = 256 horizontal blocks, and 256 / 64 = 4 vertical blocks

I think keeping things as horizontal as possible is best, because this gives better alignment and sequential speed.

See F 4.2 figure F.1 it seems to apply to compute 1.1 and 1.2 too (yours).

Thanks for clearifieng that banks are only related to shared memory ! External Image =D

Skybuck · June 16, 2011, 9:31am

I think you should try and change the dim to 8x8 or perhaps even 16x4 or even 32x2 or even 64x1

The mesh is 256x256 so if the dim is for example 64x1 then the following happens:

256/64 = 4, 256/1 = 256

So there will be 4 horizontal blocks and 256 vertical blocks (maybe that don’t matter)

The number of threads per block will remain 64 and a total of 1024 blocks.

Perhaps the block ordering does matter you could try and swap the x and y like so:

dim3 grid(mesh_width / block.y, mesh_height / block.x, 1);

this would then give
256 / 1 = 256 horizontal blocks, and 256 / 64 = 4 vertical blocks

I think keeping things as horizontal as possible is best, because this gives better alignment and sequential speed.

See F 4.2 figure F.1 it seems to apply to compute 1.1 and 1.2 too (yours).

Thanks for clearifieng that banks are only related to shared memory ! External Image =D

freda82 · June 16, 2011, 3:13pm

Which document did you mean when you said “See F 4.2 figure F.1 it seems to apply to compute 1.1 and 1.2 too (yours)”.

I have tried these different block_dim:

block_dim (XxY) kernel_execution_time

32 threads:

8x4 ~0,0949 ms
16x2 ~0,0951 ms
32x1 ~0,0957 ms

4x8 ~0,0950 ms
2x16 ~0,0954 ms
1x32 ~0,1424 ms*

64 threads:

8x8 ~0,0881 ms
16x4 ~0,0884 ms
32x2 ~0,0887 ms
64x1 ~0,0887 ms

8x8 ~0,0881 ms
4x16 ~0,0882 ms
2x32 ~0,0889 ms
1x64 ~0,1524 ms *

I guess the * times is a result of how we access the texture in the kernel (the alignment and sequential memory access that you mentioned). Since we have pos[ywidth+x] we have to write the data to memory addresses that is widely spread for each half warp ( pos[y256 + 0] where y = 0…15 for the first half warp of block(0,0) )

Hmm I don’t really what else to check… This might be something specific for when you are writing to textures in OpenGL… External Image But I would really like to know what it is. I don’t want to miss something essential that will come back an haunt me.
Do you have any more idee’s? External Image

freda82 · June 16, 2011, 3:13pm

Which document did you mean when you said “See F 4.2 figure F.1 it seems to apply to compute 1.1 and 1.2 too (yours)”.

I have tried these different block_dim:

block_dim (XxY) kernel_execution_time

32 threads:

8x4 ~0,0949 ms
16x2 ~0,0951 ms
32x1 ~0,0957 ms

4x8 ~0,0950 ms
2x16 ~0,0954 ms
1x32 ~0,1424 ms*

64 threads:

8x8 ~0,0881 ms
16x4 ~0,0884 ms
32x2 ~0,0887 ms
64x1 ~0,0887 ms

8x8 ~0,0881 ms
4x16 ~0,0882 ms
2x32 ~0,0889 ms
1x64 ~0,1524 ms *

I guess the * times is a result of how we access the texture in the kernel (the alignment and sequential memory access that you mentioned). Since we have pos[ywidth+x] we have to write the data to memory addresses that is widely spread for each half warp ( pos[y256 + 0] where y = 0…15 for the first half warp of block(0,0) )

Hmm I don’t really what else to check… This might be something specific for when you are writing to textures in OpenGL… External Image But I would really like to know what it is. I don’t want to miss something essential that will come back an haunt me.
Do you have any more idee’s? External Image

tera · June 16, 2011, 3:38pm

So does this new data still support the original finding? Or is the thread title now obsolete?

tera · June 16, 2011, 3:38pm

So does this new data still support the original finding? Or is the thread title now obsolete?

insmvb00 · June 16, 2011, 3:40pm

Hi!

The 1xXX configurations have bigger times because each access causes a memory transaction (uncoalesced).

The blocks of 64 threads have better times than the smaller blocks cause if a warp accesses to memory, the warp scheduler can choose another warp to execute if it is ready. With blocks of 32 threads, the warp scheduler has to change the thread block when its only warp was waiting for memory accesses. A SM can not have executing warps of more than one thread block.
If you increases the amount of threads to 128 or even to 256, probably you obtain better times cause now the warp scheduler has more warps ready to choose before changing of thread block.

Regards!

insmvb00 · June 16, 2011, 3:40pm

Hi!

The 1xXX configurations have bigger times because each access causes a memory transaction (uncoalesced).

The blocks of 64 threads have better times than the smaller blocks cause if a warp accesses to memory, the warp scheduler can choose another warp to execute if it is ready. With blocks of 32 threads, the warp scheduler has to change the thread block when its only warp was waiting for memory accesses. A SM can not have executing warps of more than one thread block.
If you increases the amount of threads to 128 or even to 256, probably you obtain better times cause now the warp scheduler has more warps ready to choose before changing of thread block.

Regards!

freda82 · June 16, 2011, 5:46pm

Hi insvmvb00!

First of, thank you for responding on my post External Image

Yes I’m aware of that the 1xXX causes a uncoalesced memory access (I mentioned it in one of the posts before). Since every thread writes 4 floats ( 16 bytes ) to the texture there will be no waste of bandwidth if the block_dim is x_thread = 8 or 16 and y_thread = (1…64 or 1…32 ) ( the texture is addressed as pos[y*width + x] = make_float4() )since the memory request will be split into 4 requests per quarter-warp ( 8 threads x 16 bytes = 128 bytes ) which satisfies the condition for fully coalesced memory access.

This is what I thougt as well. Since if I only use 8x8 I will only have 512 threads active per multiprocessor ( 8 resident blocks) but when I increased the block_dim to 8x16 or 16x8 the kernel execution time became longer despite the fact that I now would have 8 resident blocks / multiprocessor as before but 1024 active threads. ( I would however not chosen 256 threads since that would mean that I would only have 4 resident blocks on every multiprocessor at any time)

So the question is really: Why does the kernel execution time become longer when I have more active threads per multiprocessor as described above?

The NIVIDA people that made the example must have chosen block_dim 8x8 for some reason. I cant just understand why…

freda82 · June 16, 2011, 5:46pm

Hi insvmvb00!

First of, thank you for responding on my post External Image

Yes I’m aware of that the 1xXX causes a uncoalesced memory access (I mentioned it in one of the posts before). Since every thread writes 4 floats ( 16 bytes ) to the texture there will be no waste of bandwidth if the block_dim is x_thread = 8 or 16 and y_thread = (1…64 or 1…32 ) ( the texture is addressed as pos[y*width + x] = make_float4() )since the memory request will be split into 4 requests per quarter-warp ( 8 threads x 16 bytes = 128 bytes ) which satisfies the condition for fully coalesced memory access.

This is what I thougt as well. Since if I only use 8x8 I will only have 512 threads active per multiprocessor ( 8 resident blocks) but when I increased the block_dim to 8x16 or 16x8 the kernel execution time became longer despite the fact that I now would have 8 resident blocks / multiprocessor as before but 1024 active threads. ( I would however not chosen 256 threads since that would mean that I would only have 4 resident blocks on every multiprocessor at any time)

So the question is really: Why does the kernel execution time become longer when I have more active threads per multiprocessor as described above?

The NIVIDA people that made the example must have chosen block_dim 8x8 for some reason. I cant just understand why…

Topic		Replies	Views
Doubts related to CUDA CUDA Programming and Performance	17	11808	November 18, 2010
Efficient use of shared memory CUDA Programming and Performance	29	4484	December 2, 2019
Branch Divergence Serialization (Threads/hardware stalls ?) Performance Impact ? Branch divergence s CUDA Programming and Performance	3	1569	June 15, 2011
Test Multi Threading Spinning CUDA Programming and Performance	32	4811	July 20, 2011
questions about thread execution & volatile CUDA Programming and Performance	19	16896	December 29, 2008
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19674	July 5, 2011
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29325	April 23, 2010
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16300	January 30, 2011
Using Shared Memory in CUDA C/C++ Technical Blog	36	1994	October 8, 2020
Texture Unit in Pascal architecture CUDA Programming and Performance	24	4034	April 11, 2018

More threads/block increase kernel execution time. WHY?

Related topics