More threads/block increase kernel execution time. WHY?

Hi All,

I’m currently running the simpleGL example from the SDK and I’ve been playing around with different block sizes and I have encountered something that I hope someone can explain to me.

The original block size in the example is set to 8x8 = 64 threads and when I run ptxas option I get 14 registers used per thread and the occupancy calculator gives a occupancy of 50 % on each multiprocessor since we can only have 8 resident blocks on each multiprocessor at a given time and this gives us 512 active threads. So my idea was to set the threads/block to 128 since then we would have 8 resident blocks per multiprocessor and 1024 active threads and this I thought would increase the execution time of the kernel but no…

So my question is: Does anyone know why this is the case? The block dim will be 16x8 when running 128 threads/block, can this affect latency of memory writes in the kernel?

Additional information:

The mesh size is 256x256 and from the cudaprofiler i get an instruction throughput of 1.04115 for threads/block 8x8
and 0,967362 for thread/blocks 16x8.

My graphics card : Quadro FX 1800M

Thank you in advance!=)

Hi All,

I’m currently running the simpleGL example from the SDK and I’ve been playing around with different block sizes and I have encountered something that I hope someone can explain to me.

The original block size in the example is set to 8x8 = 64 threads and when I run ptxas option I get 14 registers used per thread and the occupancy calculator gives a occupancy of 50 % on each multiprocessor since we can only have 8 resident blocks on each multiprocessor at a given time and this gives us 512 active threads. So my idea was to set the threads/block to 128 since then we would have 8 resident blocks per multiprocessor and 1024 active threads and this I thought would increase the execution time of the kernel but no…

So my question is: Does anyone know why this is the case? The block dim will be 16x8 when running 128 threads/block, can this affect latency of memory writes in the kernel?

Additional information:

The mesh size is 256x256 and from the cudaprofiler i get an instruction throughput of 1.04115 for threads/block 8x8
and 0,967362 for thread/blocks 16x8.

My graphics card : Quadro FX 1800M

Thank you in advance!=)

As thread number goes up, fewer registers are available and also fewer shared memory per thread.

So perhaps that’s at play… though from specs it would seem there is plenty of registers.

Try running some device query tool and post precise specs here… plus perhaps some numbers about your kernels… like how many register per thread but also shared memory and such…

As thread number goes up, fewer registers are available and also fewer shared memory per thread.

So perhaps that’s at play… though from specs it would seem there is plenty of registers.

Try running some device query tool and post precise specs here… plus perhaps some numbers about your kernels… like how many register per thread but also shared memory and such…

Hi Skybuck!

First of, thank you for taking an interest in this=)

I was also thinking that it could be limited by the number of registers used per thread. I don’t use any dynamically allocated shared memory my self but as I mentioned in my last post I use 14 registers/thread so that should not be a bottleneck (as you also mentioned above) (1024 threads * 14 = 14336 (max 16384) ) and I only use 16 + 16 bytes of shared memory for the kernel.

My other thought was that the memory access should be noncoalesced if I changed the block size but since I use float4 there should be no unused memory instruction when writing to the texture…

Here are the specs for my board:

ptxas info: Used 14 registers, 56+0 bytes lmem, 16+16 bytes smem, 24 bytes cmem[0], 112 bytes cmem[1]

CUDA Driver Version : 4.0

CUDA Runtime Version: 3.0

CUDA Capability : 1.2

Number of multicores: 9

Number of cores: 72

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total amount of registers available per block: 16384 bytes

Max threads per block: 512

Hi Skybuck!

First of, thank you for taking an interest in this=)

I was also thinking that it could be limited by the number of registers used per thread. I don’t use any dynamically allocated shared memory my self but as I mentioned in my last post I use 14 registers/thread so that should not be a bottleneck (as you also mentioned above) (1024 threads * 14 = 14336 (max 16384) ) and I only use 16 + 16 bytes of shared memory for the kernel.

My other thought was that the memory access should be noncoalesced if I changed the block size but since I use float4 there should be no unused memory instruction when writing to the texture…

Here are the specs for my board:

ptxas info: Used 14 registers, 56+0 bytes lmem, 16+16 bytes smem, 24 bytes cmem[0], 112 bytes cmem[1]

CUDA Driver Version : 4.0

CUDA Runtime Version: 3.0

CUDA Capability : 1.2

Number of multicores: 9

Number of cores: 72

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total amount of registers available per block: 16384 bytes

Max threads per block: 512

So you increased the number of total threads from 8x8 to 256x256 ? Otherwise provide some more details.

Perhaps it’s time for some code to look it ;)

Perhaps changing the thread layout might change the memory access pattern.

It’s probably best to access each 32 bits by a different thread for maximum memory bank usage. However 1.2 is a bit different than 2.0, but should still be good advice.

Each next 32 bits in memory is served by a different bank.

If this has something to do with your case is hard to say… perhaps not… but could be something to look at… so some code might give further insights if anything changed or so ;)

Also from what I understand from the manual it seems best to have “odd” indexing.

So a thread should move it’s index from 1 to 3 to 5 to 7 if I am not mistaken… this would probably cause maximum bank usage (so least amount of bank conflicts)

Even if strides are big… (I think with strides is ment “gaps” between memory/indexes)

So even a stride of 7 or 101 or so… should still be better than a stride of 8 or 102 or something like that. Though 8 is still good for compute 2.0. (64 bit doubles and integers). But 8 is bad for compute 1.1 or so.

So you increased the number of total threads from 8x8 to 256x256 ? Otherwise provide some more details.

Perhaps it’s time for some code to look it ;)

Perhaps changing the thread layout might change the memory access pattern.

It’s probably best to access each 32 bits by a different thread for maximum memory bank usage. However 1.2 is a bit different than 2.0, but should still be good advice.

Each next 32 bits in memory is served by a different bank.

If this has something to do with your case is hard to say… perhaps not… but could be something to look at… so some code might give further insights if anything changed or so ;)

Also from what I understand from the manual it seems best to have “odd” indexing.

So a thread should move it’s index from 1 to 3 to 5 to 7 if I am not mistaken… this would probably cause maximum bank usage (so least amount of bank conflicts)

Even if strides are big… (I think with strides is ment “gaps” between memory/indexes)

So even a stride of 7 or 101 or so… should still be better than a stride of 8 or 102 or something like that. Though 8 is still good for compute 2.0. (64 bit doubles and integers). But 8 is bad for compute 1.1 or so.

Hi!

I increased the total number of threads/block to 8x16 (128 threads/block). Sorry if I’m being unclear=)

Bank conflicts do not however have anything to do with this since I do not use any shared memory inside the kernel.

This depends on what you store in the shared memory and how you access the shared memory =)

Here is the code from the kernel

__global__ void modify_vertex_pos_kernel(float4* pos, unsigned int width, unsigned int height, float time)

{

    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

// calculate uv coordinates

    float u = x / (float) width;

    float v = y / (float) height;

    u = u*2.0f - 1.0f;

    v = v*2.0f - 1.0f;

// calculate simple sine wave pattern

    float freq = 4.0f;

    float w = sinf(u*freq + time) * cosf(v*freq + time) * 0.5f;

// write output vertex

    pos[y*width+x] = make_float4(u, w, v, 1.0f);

}

My first concern was that when I changed the blocks dim from 8x8 to 8x16 I would have uncoalesced memeory access but I don’t think that this is the case:

Example:

Block dim 8x8:

The first half warp should load the following threads:

IDwarp = x + Dx*y

IDwarp = 0…15 gives x = 0…7 and y = 0…2

Since every thread copies a float4 each thread will perform a 4 bytes4 (16 bytes) memory transaction and for a half warp this will be a 1616 byte = 256 bytes. This would generate two 128 byte memory transfer per half warp with no unused memory addressing.

If the block dim should be 8x16 the same should be satisfied. Do you agree?

I’m still kind of puzzled why the block dim 8x8 is faster then a block dim 8x16… We should have more threads active per multiprocessor if we have 128 threads per block (1024 active threads) instead of 64 threads per block (512 active threads).

Any idee?=)

Hi!

I increased the total number of threads/block to 8x16 (128 threads/block). Sorry if I’m being unclear=)

Bank conflicts do not however have anything to do with this since I do not use any shared memory inside the kernel.

This depends on what you store in the shared memory and how you access the shared memory =)

Here is the code from the kernel

__global__ void modify_vertex_pos_kernel(float4* pos, unsigned int width, unsigned int height, float time)

{

    unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

// calculate uv coordinates

    float u = x / (float) width;

    float v = y / (float) height;

    u = u*2.0f - 1.0f;

    v = v*2.0f - 1.0f;

// calculate simple sine wave pattern

    float freq = 4.0f;

    float w = sinf(u*freq + time) * cosf(v*freq + time) * 0.5f;

// write output vertex

    pos[y*width+x] = make_float4(u, w, v, 1.0f);

}

My first concern was that when I changed the blocks dim from 8x8 to 8x16 I would have uncoalesced memeory access but I don’t think that this is the case:

Example:

Block dim 8x8:

The first half warp should load the following threads:

IDwarp = x + Dx*y

IDwarp = 0…15 gives x = 0…7 and y = 0…2

Since every thread copies a float4 each thread will perform a 4 bytes4 (16 bytes) memory transaction and for a half warp this will be a 1616 byte = 256 bytes. This would generate two 128 byte memory transfer per half warp with no unused memory addressing.

If the block dim should be 8x16 the same should be satisfied. Do you agree?

I’m still kind of puzzled why the block dim 8x8 is faster then a block dim 8x16… We should have more threads active per multiprocessor if we have 128 threads per block (1024 active threads) instead of 64 threads per block (512 active threads).

Any idee?=)

I think you should try and change the dim to 8x8 or perhaps even 16x4 or even 32x2 or even 64x1

The mesh is 256x256 so if the dim is for example 64x1 then the following happens:

256/64 = 4, 256/1 = 256

So there will be 4 horizontal blocks and 256 vertical blocks (maybe that don’t matter)

The number of threads per block will remain 64 and a total of 1024 blocks.

Perhaps the block ordering does matter you could try and swap the x and y like so:

dim3 grid(mesh_width / block.y, mesh_height / block.x, 1);

this would then give
256 / 1 = 256 horizontal blocks, and 256 / 64 = 4 vertical blocks

I think keeping things as horizontal as possible is best, because this gives better alignment and sequential speed.

See F 4.2 figure F.1 it seems to apply to compute 1.1 and 1.2 too (yours).

Thanks for clearifieng that banks are only related to shared memory ! ;) =D

I think you should try and change the dim to 8x8 or perhaps even 16x4 or even 32x2 or even 64x1

The mesh is 256x256 so if the dim is for example 64x1 then the following happens:

256/64 = 4, 256/1 = 256

So there will be 4 horizontal blocks and 256 vertical blocks (maybe that don’t matter)

The number of threads per block will remain 64 and a total of 1024 blocks.

Perhaps the block ordering does matter you could try and swap the x and y like so:

dim3 grid(mesh_width / block.y, mesh_height / block.x, 1);

this would then give
256 / 1 = 256 horizontal blocks, and 256 / 64 = 4 vertical blocks

I think keeping things as horizontal as possible is best, because this gives better alignment and sequential speed.

See F 4.2 figure F.1 it seems to apply to compute 1.1 and 1.2 too (yours).

Thanks for clearifieng that banks are only related to shared memory ! ;) =D

Which document did you mean when you said “See F 4.2 figure F.1 it seems to apply to compute 1.1 and 1.2 too (yours)”.

I have tried these different block_dim:

block_dim (XxY) kernel_execution_time

32 threads:

8x4 ~0,0949 ms
16x2 ~0,0951 ms
32x1 ~0,0957 ms

4x8 ~0,0950 ms
2x16 ~0,0954 ms
1x32 ~0,1424 ms*

64 threads:

8x8 ~0,0881 ms
16x4 ~0,0884 ms
32x2 ~0,0887 ms
64x1 ~0,0887 ms

8x8 ~0,0881 ms
4x16 ~0,0882 ms
2x32 ~0,0889 ms
1x64 ~0,1524 ms *

I guess the * times is a result of how we access the texture in the kernel (the alignment and sequential memory access that you mentioned). Since we have pos[ywidth+x] we have to write the data to memory addresses that is widely spread for each half warp ( pos[y256 + 0] where y = 0…15 for the first half warp of block(0,0) )

Hmm I don’t really what else to check… This might be something specific for when you are writing to textures in OpenGL… :argh: But I would really like to know what it is. I don’t want to miss something essential that will come back an haunt me.
Do you have any more idee’s? :no:

Which document did you mean when you said “See F 4.2 figure F.1 it seems to apply to compute 1.1 and 1.2 too (yours)”.

I have tried these different block_dim:

block_dim (XxY) kernel_execution_time

32 threads:

8x4 ~0,0949 ms
16x2 ~0,0951 ms
32x1 ~0,0957 ms

4x8 ~0,0950 ms
2x16 ~0,0954 ms
1x32 ~0,1424 ms*

64 threads:

8x8 ~0,0881 ms
16x4 ~0,0884 ms
32x2 ~0,0887 ms
64x1 ~0,0887 ms

8x8 ~0,0881 ms
4x16 ~0,0882 ms
2x32 ~0,0889 ms
1x64 ~0,1524 ms *

I guess the * times is a result of how we access the texture in the kernel (the alignment and sequential memory access that you mentioned). Since we have pos[ywidth+x] we have to write the data to memory addresses that is widely spread for each half warp ( pos[y256 + 0] where y = 0…15 for the first half warp of block(0,0) )

Hmm I don’t really what else to check… This might be something specific for when you are writing to textures in OpenGL… :argh: But I would really like to know what it is. I don’t want to miss something essential that will come back an haunt me.
Do you have any more idee’s? :no:

So does this new data still support the original finding? Or is the thread title now obsolete?

So does this new data still support the original finding? Or is the thread title now obsolete?

Hi!

The 1xXX configurations have bigger times because each access causes a memory transaction (uncoalesced).

The blocks of 64 threads have better times than the smaller blocks cause if a warp accesses to memory, the warp scheduler can choose another warp to execute if it is ready. With blocks of 32 threads, the warp scheduler has to change the thread block when its only warp was waiting for memory accesses. A SM can not have executing warps of more than one thread block.
If you increases the amount of threads to 128 or even to 256, probably you obtain better times cause now the warp scheduler has more warps ready to choose before changing of thread block.

Regards!

Hi!

The 1xXX configurations have bigger times because each access causes a memory transaction (uncoalesced).

The blocks of 64 threads have better times than the smaller blocks cause if a warp accesses to memory, the warp scheduler can choose another warp to execute if it is ready. With blocks of 32 threads, the warp scheduler has to change the thread block when its only warp was waiting for memory accesses. A SM can not have executing warps of more than one thread block.
If you increases the amount of threads to 128 or even to 256, probably you obtain better times cause now the warp scheduler has more warps ready to choose before changing of thread block.

Regards!

Hi insvmvb00!

First of, thank you for responding on my post :smile:

Yes I’m aware of that the 1xXX causes a uncoalesced memory access (I mentioned it in one of the posts before). Since every thread writes 4 floats ( 16 bytes ) to the texture there will be no waste of bandwidth if the block_dim is x_thread = 8 or 16 and y_thread = (1…64 or 1…32 ) ( the texture is addressed as pos[y*width + x] = make_float4() )since the memory request will be split into 4 requests per quarter-warp ( 8 threads x 16 bytes = 128 bytes ) which satisfies the condition for fully coalesced memory access.

This is what I thougt as well. Since if I only use 8x8 I will only have 512 threads active per multiprocessor ( 8 resident blocks) but when I increased the block_dim to 8x16 or 16x8 the kernel execution time became longer despite the fact that I now would have 8 resident blocks / multiprocessor as before but 1024 active threads. ( I would however not chosen 256 threads since that would mean that I would only have 4 resident blocks on every multiprocessor at any time)

So the question is really: Why does the kernel execution time become longer when I have more active threads per multiprocessor as described above?

The NIVIDA people that made the example must have chosen block_dim 8x8 for some reason. I cant just understand why…

Hi insvmvb00!

First of, thank you for responding on my post :smile:

Yes I’m aware of that the 1xXX causes a uncoalesced memory access (I mentioned it in one of the posts before). Since every thread writes 4 floats ( 16 bytes ) to the texture there will be no waste of bandwidth if the block_dim is x_thread = 8 or 16 and y_thread = (1…64 or 1…32 ) ( the texture is addressed as pos[y*width + x] = make_float4() )since the memory request will be split into 4 requests per quarter-warp ( 8 threads x 16 bytes = 128 bytes ) which satisfies the condition for fully coalesced memory access.

This is what I thougt as well. Since if I only use 8x8 I will only have 512 threads active per multiprocessor ( 8 resident blocks) but when I increased the block_dim to 8x16 or 16x8 the kernel execution time became longer despite the fact that I now would have 8 resident blocks / multiprocessor as before but 1024 active threads. ( I would however not chosen 256 threads since that would mean that I would only have 4 resident blocks on every multiprocessor at any time)

So the question is really: Why does the kernel execution time become longer when I have more active threads per multiprocessor as described above?

The NIVIDA people that made the example must have chosen block_dim 8x8 for some reason. I cant just understand why…