More threads/block increase kernel execution time. WHY?

Hi tera!

The topic is still valid. =)

Hi tera!

The topic is still valid. =)

The thread layout/organization/dimensionality probably determines how the threads can be distributed across “sub cores” within “a core”.

Each GPU seems to be a bit different so how the setup affects your GPU, is GPU specific.

A block can only run on a single “core” as far as I know.

But the block itself can further be split up among “sub cores”.

This probably requires the block itself to be split up into multiple dimensions.

So that’s what you/one does when specifieing a block of 8x8 or 2x2x2 (something you could also try) (Though a core is probably divided into 2D and not 3D but try anyway just for the fun of it).

I suspect it works something like this: x,y,z (x=threads per sub core, y = subcore) (so x=warp size, y=subcore)
So 8x8 gives 8 threads per sub core, 8 sub cores used.
So 16x4 gives 16 threads per sub core, 4 sub cores used
So 32x2 gives 32 threads per sub core, 2 sub cores used
So 64x1 gives 64 threads per sub core, 1 sub core used.

^ This explain why 64x1 is probably bad, a sub core can only run 32 threads in parallel. (So it needs 2 runs x 32)

Thus it’s time doubles !

It probably doesnt matter if 8x8 or 16x4 or 32x2 is chosen because the X number fits in a warp size, and the y number fits in the ammount of subcores. So it’s pretty much equivalent (This is a good sign, this means the hardware is pretty efficient no matter what the setup is as long as it can all run in parallel).

Let’s see what happens the other way around:

I am not sure but perhaps this would be converted to hardware as follows:

So 8x8 gives 8 threads per sub core, 8 sub cores used.
So 4x16 gives 4 threads per sub core, 16 sub cores used.
So 2x32 gives 2 threads per sub core, 32 sub cores used.
So 1x64 gives 1 threads per sub core, 64 sub cores used.

I am not exactly sure what the specs of your gpu mean:

It was 9 multi cores and 72 cores. 72/9 = 8.

So perhaps it has 9 cores and 8 subcores.

If the second table is correct then only the first one would execute fast since your gpu only has 9 cores and 8 sub cores so it still fits.

The second entry would be 4 cores and 2x8 sub cores, so this would probably double execution time.

The third entry would be 2 cores and 4x8 sub cores, so this would probably double execution time again.

The fouth entry would be 1 core and 8x8 sub cores, so this would probably double execution time again.

But this is not what seems to happen, since your times didn’t double.

So I suspect the hardware simply swaps the dimensions around so:

1x64 becomes: 64x1

Which explains why the time is the same as the first table.

The thread layout/organization/dimensionality probably determines how the threads can be distributed across “sub cores” within “a core”.

Each GPU seems to be a bit different so how the setup affects your GPU, is GPU specific.

A block can only run on a single “core” as far as I know.

But the block itself can further be split up among “sub cores”.

This probably requires the block itself to be split up into multiple dimensions.

So that’s what you/one does when specifieing a block of 8x8 or 2x2x2 (something you could also try) (Though a core is probably divided into 2D and not 3D but try anyway just for the fun of it).

I suspect it works something like this: x,y,z (x=threads per sub core, y = subcore) (so x=warp size, y=subcore)
So 8x8 gives 8 threads per sub core, 8 sub cores used.
So 16x4 gives 16 threads per sub core, 4 sub cores used
So 32x2 gives 32 threads per sub core, 2 sub cores used
So 64x1 gives 64 threads per sub core, 1 sub core used.

^ This explain why 64x1 is probably bad, a sub core can only run 32 threads in parallel. (So it needs 2 runs x 32)

Thus it’s time doubles !

It probably doesnt matter if 8x8 or 16x4 or 32x2 is chosen because the X number fits in a warp size, and the y number fits in the ammount of subcores. So it’s pretty much equivalent (This is a good sign, this means the hardware is pretty efficient no matter what the setup is as long as it can all run in parallel).

Let’s see what happens the other way around:

I am not sure but perhaps this would be converted to hardware as follows:

So 8x8 gives 8 threads per sub core, 8 sub cores used.
So 4x16 gives 4 threads per sub core, 16 sub cores used.
So 2x32 gives 2 threads per sub core, 32 sub cores used.
So 1x64 gives 1 threads per sub core, 64 sub cores used.

I am not exactly sure what the specs of your gpu mean:

It was 9 multi cores and 72 cores. 72/9 = 8.

So perhaps it has 9 cores and 8 subcores.

If the second table is correct then only the first one would execute fast since your gpu only has 9 cores and 8 sub cores so it still fits.

The second entry would be 4 cores and 2x8 sub cores, so this would probably double execution time.

The third entry would be 2 cores and 4x8 sub cores, so this would probably double execution time again.

The fouth entry would be 1 core and 8x8 sub cores, so this would probably double execution time again.

But this is not what seems to happen, since your times didn’t double.

So I suspect the hardware simply swaps the dimensions around so:

1x64 becomes: 64x1

Which explains why the time is the same as the first table.

My explanation above is probably not correct, I already had a hunch how it probably works… but now I think I fully understand.

What probably happens is the following:

A thread block is converted to a 1D array of threads.

The warp schedular tries to schedule these threads over the cuda cores.

A cuda core can only execute 1 thread.

The warp schedular tries to group the cuda cores together into a warp, a warp is probably a memory efficiency and instruction cache sharing technique.

Anyway the moral of the story is:

The dimensions of the thread block probably shouldn’t matter, since they all get distributed by the warp schedular in the same way.

However there still is a measuring difference which is still not explained.

Maybe this has to do something with the grid.

It’s a bit strange.

Your gpu probably has 8*9 = 72 cuda cores.

So it can execute 72 threads in parallel at most.

Perhaps this info will help to figure out what is causing the difference.

My explanation above is probably not correct, I already had a hunch how it probably works… but now I think I fully understand.

What probably happens is the following:

A thread block is converted to a 1D array of threads.

The warp schedular tries to schedule these threads over the cuda cores.

A cuda core can only execute 1 thread.

The warp schedular tries to group the cuda cores together into a warp, a warp is probably a memory efficiency and instruction cache sharing technique.

Anyway the moral of the story is:

The dimensions of the thread block probably shouldn’t matter, since they all get distributed by the warp schedular in the same way.

However there still is a measuring difference which is still not explained.

Maybe this has to do something with the grid.

It’s a bit strange.

Your gpu probably has 8*9 = 72 cuda cores.

So it can execute 72 threads in parallel at most.

Perhaps this info will help to figure out what is causing the difference.

Hi Skybuck!

Yes, your first post was not correct. As you said in you second one, my GPU have 9 multiprocessor which each have 8 cores.

Yes it is.

The dimension of the thread block do however effect the following: (some of which you and I have already mentioned)

  • Total number of blocks to be distributed on all available multiprocessors.

  • Total number of resident blocks per multiprocessor (total number of active warps per multiprocessor)

  • How many memory transaction needed by the warp schedulare to perform to execute one warp (this depends on how you read/write the elements from memory and what you read/write)

The last one is the one that causes the time difference between the executions I mentioned before:

Example:

The warp scheduler always groups threads inside the same block together according to IDw = (0…31) (32…63) and so on… where IDw is:

(1) 1D: IDw = threadXid;

(2) 2D: IDw = threadXid + threadYid*Dx ( where Dx is the dimension of the blocks size in the x-direction)

(3) 3D: IDw = threadXid + threadYidDx + threadZidDx*Dy

The following below is for compute capability 1.2-1.3

If we have a block size of 64x1 (x_threads = 64 and y_threads = 1) lets see how the warp scheduler should organize the threads for the following kernel:

unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

// calculate uv coordinates

    float u = x / (float) width;

    float v = y / (float) height;

    u = u*2.0f - 1.0f;

    v = v*2.0f - 1.0f;

// calculate simple sine wave pattern

    float freq = 4.0f;

    float w = sinf(u*freq + time) * cosf(v*freq + time) * 0.5f;

// write output vertex

    pos[y*width+x] = make_float4(u, w, v, 1.0f);

Since every thread in the above kernel writes 16 bytes the warp scheduler will try to split this into 4x128 bytes memory requests ( one for each quarter warp )

if we just focus on the first block ( blockIdx.x = 0,blockIdy.y = 0) and look at the first quarter warp ,IDw = 0…8

we would have x = 0…8 y = 0 according to (2). If we then look at how the global memory should be writen for the first quarter warp we would have:

pos[0] = make_float4(..)  16 bytes

pos[1] = make_float4(..)  16 bytes

pos[2] = make_float4(..)  16 bytes

.

.

pos[7] = make_float4(..) 16 bytes

This will be fit into 128 byte memory request (16 bytes*8) by the warp scheduler since we access the global memory in sequence hence we will have coalescing memory access.

If we on the other hand should have y_thread = 64 and x_thread = 1 we would have x = 0 y = 0…8 according to (2). If we then look at how the global memory should be writen for the first quarter warp we would have:

pos[256*0] = make_float4(..)

pos[256*1] = make_float4(..)

pos[256*2] = make_float4(..)

.

.

pos[256*7] = make_float4(..)

Here on the other hand the warp scheduler will have to issue 8 * 128 bytes memory request, BUT since only the lower half of every memory request is used it will reduce it to 8*32 bytes memory request and for every memory transaction we will only “use” 16 bytes so we would have “wasted” 50% of the memory transactions since we requested 50% more than we use and the kernel execution time should be around the double… ( as the measured times in one of my previous posts)

If we instead would have y_thread = 32 and x_thread = 2 we would have x = 0…1 y = 0…4 according to (2). If we then look at how the global memory should be writen for the first quarter warp we would have:

pos[0 + 256*0] = make_float4(..)

pos[1 + 256*1] = make_float4(..)

pos[0+256*1] = make_float4(..)

pos[1+256*1] = make_float4(..)

.

.

pos[0+256*4] = make_float4(..)

pos[1+256*4] = make_float4(..)

Here on the other hand the warp scheduler will have to issue 4 * 128 bytes memory request, BUT since only the lower half of every memory request is used it will reduce it to 432 bytes memory request and for every memory transaction we will use 32 bytes so we would not have wasted any of the memory transactions. This would give almost the same time as if we only issued a 128 byte memory transaction but since issuing one memory transaction of 128 bytes is faster than issuing 432 bytes the kernel execution time should be a bit higher ( as the measured time in one of my previous posts)

So far every time I have measured make sense HOWEVER I’m still confused of why a block dim of 8x16 is’nt running faster… We should have more active threads per multiprocessor and so on as I mentioned in my last post…

Does anyone have any idee? (or questions regarding the above…)

Hi Skybuck!

Yes, your first post was not correct. As you said in you second one, my GPU have 9 multiprocessor which each have 8 cores.

Yes it is.

The dimension of the thread block do however effect the following: (some of which you and I have already mentioned)

  • Total number of blocks to be distributed on all available multiprocessors.

  • Total number of resident blocks per multiprocessor (total number of active warps per multiprocessor)

  • How many memory transaction needed by the warp schedulare to perform to execute one warp (this depends on how you read/write the elements from memory and what you read/write)

The last one is the one that causes the time difference between the executions I mentioned before:

Example:

The warp scheduler always groups threads inside the same block together according to IDw = (0…31) (32…63) and so on… where IDw is:

(1) 1D: IDw = threadXid;

(2) 2D: IDw = threadXid + threadYid*Dx ( where Dx is the dimension of the blocks size in the x-direction)

(3) 3D: IDw = threadXid + threadYidDx + threadZidDx*Dy

The following below is for compute capability 1.2-1.3

If we have a block size of 64x1 (x_threads = 64 and y_threads = 1) lets see how the warp scheduler should organize the threads for the following kernel:

unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

    unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

// calculate uv coordinates

    float u = x / (float) width;

    float v = y / (float) height;

    u = u*2.0f - 1.0f;

    v = v*2.0f - 1.0f;

// calculate simple sine wave pattern

    float freq = 4.0f;

    float w = sinf(u*freq + time) * cosf(v*freq + time) * 0.5f;

// write output vertex

    pos[y*width+x] = make_float4(u, w, v, 1.0f);

Since every thread in the above kernel writes 16 bytes the warp scheduler will try to split this into 4x128 bytes memory requests ( one for each quarter warp )

if we just focus on the first block ( blockIdx.x = 0,blockIdy.y = 0) and look at the first quarter warp ,IDw = 0…8

we would have x = 0…8 y = 0 according to (2). If we then look at how the global memory should be writen for the first quarter warp we would have:

pos[0] = make_float4(..)  16 bytes

pos[1] = make_float4(..)  16 bytes

pos[2] = make_float4(..)  16 bytes

.

.

pos[7] = make_float4(..) 16 bytes

This will be fit into 128 byte memory request (16 bytes*8) by the warp scheduler since we access the global memory in sequence hence we will have coalescing memory access.

If we on the other hand should have y_thread = 64 and x_thread = 1 we would have x = 0 y = 0…8 according to (2). If we then look at how the global memory should be writen for the first quarter warp we would have:

pos[256*0] = make_float4(..)

pos[256*1] = make_float4(..)

pos[256*2] = make_float4(..)

.

.

pos[256*7] = make_float4(..)

Here on the other hand the warp scheduler will have to issue 8 * 128 bytes memory request, BUT since only the lower half of every memory request is used it will reduce it to 8*32 bytes memory request and for every memory transaction we will only “use” 16 bytes so we would have “wasted” 50% of the memory transactions since we requested 50% more than we use and the kernel execution time should be around the double… ( as the measured times in one of my previous posts)

If we instead would have y_thread = 32 and x_thread = 2 we would have x = 0…1 y = 0…4 according to (2). If we then look at how the global memory should be writen for the first quarter warp we would have:

pos[0 + 256*0] = make_float4(..)

pos[1 + 256*1] = make_float4(..)

pos[0+256*1] = make_float4(..)

pos[1+256*1] = make_float4(..)

.

.

pos[0+256*4] = make_float4(..)

pos[1+256*4] = make_float4(..)

Here on the other hand the warp scheduler will have to issue 4 * 128 bytes memory request, BUT since only the lower half of every memory request is used it will reduce it to 432 bytes memory request and for every memory transaction we will use 32 bytes so we would not have wasted any of the memory transactions. This would give almost the same time as if we only issued a 128 byte memory transaction but since issuing one memory transaction of 128 bytes is faster than issuing 432 bytes the kernel execution time should be a bit higher ( as the measured time in one of my previous posts)

So far every time I have measured make sense HOWEVER I’m still confused of why a block dim of 8x16 is’nt running faster… We should have more active threads per multiprocessor and so on as I mentioned in my last post…

Does anyone have any idee? (or questions regarding the above…)

But the latest data you presented shows just that: 8x16 is some 7% faster than 8x8 threads. So there is a slight advantage, but nowhere near a factor of two which could be expected (only) if the smaller blocksize would waste half of the bandwidth.

But the latest data you presented shows just that: 8x16 is some 7% faster than 8x8 threads. So there is a slight advantage, but nowhere near a factor of two which could be expected (only) if the smaller blocksize would waste half of the bandwidth.

Hi!

Ok, sorry, i did not understand your question.

One posible explanation can be that as you are using 8 x 128 x 14 = 14336 registers by SM (almost reach the limit that is 16 Kregisters for CC 1.2 and 1.3), and the compiler perhaps makes register spilling to local memory. If occurs, this is a very slow operation. You can check it viewing you ptx code and looking for any ‘st.localXXX or ld.localXXX’ in the code.

Good luck!

Hi!

Ok, sorry, i did not understand your question.

One posible explanation can be that as you are using 8 x 128 x 14 = 14336 registers by SM (almost reach the limit that is 16 Kregisters for CC 1.2 and 1.3), and the compiler perhaps makes register spilling to local memory. If occurs, this is a very slow operation. You can check it viewing you ptx code and looking for any ‘st.localXXX or ld.localXXX’ in the code.

Good luck!

Hi tera!

I don’t think I have presented any time measurment for 8x16.( couldn’t find it anywhere in my previous posts anyway) Of course I should have since it’s an essential part :smile:

Which data did you mean regarding the 7 % faster?

The kernel execution time for running 8x16 block is about ~0.0939 ms so it runs slower then a block of 8x8.

Hi tera!

I don’t think I have presented any time measurment for 8x16.( couldn’t find it anywhere in my previous posts anyway) Of course I should have since it’s an essential part :smile:

Which data did you mean regarding the 7 % faster?

The kernel execution time for running 8x16 block is about ~0.0939 ms so it runs slower then a block of 8x8.

Oh sorry, my bad. I was referring to the 64 vs 32 threads measurements, which are compatible with your analysis.

Maybe if you can extent those measurement up to 8x16, then we can get a hint on what’s going on.

Oh sorry, my bad. I was referring to the 64 vs 32 threads measurements, which are compatible with your analysis.

Maybe if you can extent those measurement up to 8x16, then we can get a hint on what’s going on.

To me your indexing code looked suspicious from the first time I saw it, it’s probably not causing sequential access.

unsigned int x = blockIdx.xblockDim.x + threadIdx.x;
unsigned int y = blockIdx.y
blockDim.y + threadIdx.y;

Your other code depends on an X and on Y.

Plus you want sequantial access for your array it seems, simply 2D storage too it seems:

pos[y*width+x]

What you should try is give Skybuck’s General Indexing formula’s a try. I am not going to repeat the first part because it’s quite long.

But the first part allows the 6 indexes and the 6 dimension sizes to be converted properly to a 1D index.

Once the 1D index is calculated it can be converted to a 2D index.

I shall give formula’s for that last one just as a verification technique:

Y = (LinearIndex / Width);
X = (LinearIndex - X * Width);

Let’s try and see if your formula follows these rules by calculating an arbitrary chosen situation:
(Since your problem is only 4D it’s not so bad and doesn’t take so long to work out… so this is the 4D version (2Dx2D):)

The timing difference is for 1x64 threads which would lead to 256/1 and 256/64 blocks (256 x 4 blocks).

ThreadIdx.x = 1
ThreadIdx.y = 61;

BlockIdx.x = 253;
BlockIdx.y = 3;

BlockDim.x = 1;
BlockDim.y = 64;

unsigned int x = blockIdx.xblockDim.x + threadIdx.x;
unsigned int y = blockIdx.y
blockDim.y + threadIdx.y;

Calculating x based on your formula’s: 253 * 1 + 1 = 254;
Calculating y based on your formula’s: 3 * 64 + 61 = 253;

Now I am going to calculate what I think it should be based on my formula’s:

ThreadX = ThreadIdx.x;
ThreadY = ThreadIdx.y;
ThreadWidth = BlockDim.X;
ThreadHeight = BlockDim.Y;
ThreadIndex = ThreadY * ThreadWidth + ThreadX;
ThreadVolume = ThreadWidth * ThreadHeight;

BlockX = BlockIdx.X;
BlockY = BlockIdx.Y;
BlockWidth = GridDim.x;

BlockIndex = BlockY * BlockWidth + BlockX;

FinalIndex = BlockIndex * ThreadVolume + ThreadIndex;

So let’s calculate:

ThreadIdx.x = 1
ThreadIdx.y = 61;

BlockIdx.x = 253;
BlockIdx.y = 3;

BlockDim.x = 1;
BlockDim.y = 64;

GridDim.x = 256;

ThreadX = ThreadIdx.x(1);
ThreadY = ThreadIdx.y(61);
ThreadWidth = BlockDim.X(1);
ThreadHeight = BlockDim.Y(64);
ThreadIndex = ThreadY(61) * ThreadWidth(1) + ThreadX(1); // = 62
ThreadVolume = ThreadWidth(1) * ThreadHeight(64); // = 64

BlockX = BlockIdx.X(253);
BlockY = BlockIdx.Y(3);
BlockWidth = GridDim.x(256);

BlockIndex = BlockY(3) * BlockWidth(256) + BlockX(253); // 1021

FinalIndex = BlockIndex(1021) * ThreadVolume(64) + ThreadIndex(62); // = 65406;

Now that the final linear index has been calculated it can be converted to 2D:

Y = (LinearIndex(65406) / Width(256)); // = 255
X = (LinearIndex(65406) - X(255) * Width(256)); = 1

Comparing these two (x,y) pairs, the following conclusion can be made: your indexing formula’s are incorrect and are not leading to sequential access.

Advice: try replacing them by my formula’s.

Then do some new measuring tests and see how it works out.

To me your indexing code looked suspicious from the first time I saw it, it’s probably not causing sequential access.

unsigned int x = blockIdx.xblockDim.x + threadIdx.x;
unsigned int y = blockIdx.y
blockDim.y + threadIdx.y;

Your other code depends on an X and on Y.

Plus you want sequantial access for your array it seems, simply 2D storage too it seems:

pos[y*width+x]

What you should try is give Skybuck’s General Indexing formula’s a try. I am not going to repeat the first part because it’s quite long.

But the first part allows the 6 indexes and the 6 dimension sizes to be converted properly to a 1D index.

Once the 1D index is calculated it can be converted to a 2D index.

I shall give formula’s for that last one just as a verification technique:

Y = (LinearIndex / Width);
X = (LinearIndex - X * Width);

Let’s try and see if your formula follows these rules by calculating an arbitrary chosen situation:
(Since your problem is only 4D it’s not so bad and doesn’t take so long to work out… so this is the 4D version (2Dx2D):)

The timing difference is for 1x64 threads which would lead to 256/1 and 256/64 blocks (256 x 4 blocks).

ThreadIdx.x = 1
ThreadIdx.y = 61;

BlockIdx.x = 253;
BlockIdx.y = 3;

BlockDim.x = 1;
BlockDim.y = 64;

unsigned int x = blockIdx.xblockDim.x + threadIdx.x;
unsigned int y = blockIdx.y
blockDim.y + threadIdx.y;

Calculating x based on your formula’s: 253 * 1 + 1 = 254;
Calculating y based on your formula’s: 3 * 64 + 61 = 253;

Now I am going to calculate what I think it should be based on my formula’s:

ThreadX = ThreadIdx.x;
ThreadY = ThreadIdx.y;
ThreadWidth = BlockDim.X;
ThreadHeight = BlockDim.Y;
ThreadIndex = ThreadY * ThreadWidth + ThreadX;
ThreadVolume = ThreadWidth * ThreadHeight;

BlockX = BlockIdx.X;
BlockY = BlockIdx.Y;
BlockWidth = GridDim.x;

BlockIndex = BlockY * BlockWidth + BlockX;

FinalIndex = BlockIndex * ThreadVolume + ThreadIndex;

So let’s calculate:

ThreadIdx.x = 1
ThreadIdx.y = 61;

BlockIdx.x = 253;
BlockIdx.y = 3;

BlockDim.x = 1;
BlockDim.y = 64;

GridDim.x = 256;

ThreadX = ThreadIdx.x(1);
ThreadY = ThreadIdx.y(61);
ThreadWidth = BlockDim.X(1);
ThreadHeight = BlockDim.Y(64);
ThreadIndex = ThreadY(61) * ThreadWidth(1) + ThreadX(1); // = 62
ThreadVolume = ThreadWidth(1) * ThreadHeight(64); // = 64

BlockX = BlockIdx.X(253);
BlockY = BlockIdx.Y(3);
BlockWidth = GridDim.x(256);

BlockIndex = BlockY(3) * BlockWidth(256) + BlockX(253); // 1021

FinalIndex = BlockIndex(1021) * ThreadVolume(64) + ThreadIndex(62); // = 65406;

Now that the final linear index has been calculated it can be converted to 2D:

Y = (LinearIndex(65406) / Width(256)); // = 255
X = (LinearIndex(65406) - X(255) * Width(256)); = 1

Comparing these two (x,y) pairs, the following conclusion can be made: your indexing formula’s are incorrect and are not leading to sequential access.

Advice: try replacing them by my formula’s.

Then do some new measuring tests and see how it works out.

Hi SkyBuck!

Thanks for your replay but I think you have misunderstood me a tine bit =) ( or I’m being unclear ones again :yes: )
You are off-course right when you say that the memory access for 1x64 is not sequential in my code (that what I was trying to show in my last post). The memory coalsceding is not optimal until I use at least x_threads = 8.

If where bound to use 1x64 threads then your indexing would be optimal from the differance in memory accesing per quarter warp. (Time with your code ~0,102 ms compered to ~0.140 ms)
BUT for any other configuration your indexing will be non optimazed: Why?

  1. Your indexing adds more registers/computations per thread (slows it down)
  2. Since I’m writing 16 bytes per thread the warp scheduler will try to fit this (as I mentioned before) into one 128 bytes memory instruction for every quarter warp ( 8 threads). if it manage to do that then it’s as optimal as it gets regarding coalesced memory access using 16 bytes per thread for compute capabilit 1.2-1.x (I think… 2.x it changes how the warp schedules things)

Since your indexing is always sequential in that meaning that you always write sequential despite the change of block_dim the warp scheduler will always fit every memory transaction for each quarter warp into a 128 bytes instruction and the kernel execution time will be the same around 0.102 ms as long as we test with a total of 64 threads per block.

For x_threads = 2
In my indexing:
this will result in 4*32 bytes memory instructions from the warp scheduler ( My kernel execution time is however faster due to the extra instructions/registers) you use per thread.

In your code:
this will result in 1*128 bytes memory instruction from the warp scheduler

For x_threads = 4
In my indexing:
this will result in 2x64 bytes memory instruction from the warp scheduler.

In your code:
this will result in 1*128 bytes memory instruction from the warp scheduler.

For x_threads = 8
In my indexing:
This will result in 1x128 bytes memory instruction from the warp scheduler.

In you code:
this will result in 1*128 bytes memory instruction from the warp scheduler.

Here when the warp schedulers use one 128 bytes memory instruction you can see the biggest gap between the two implementations :
My: ~0.088 ms
Yours: ~0.102 ms

I will check and see if the registers are spilled to local memory as insmvb00 suggested.

Hi SkyBuck!

Thanks for your replay but I think you have misunderstood me a tine bit =) ( or I’m being unclear ones again :yes: )
You are off-course right when you say that the memory access for 1x64 is not sequential in my code (that what I was trying to show in my last post). The memory coalsceding is not optimal until I use at least x_threads = 8.

If where bound to use 1x64 threads then your indexing would be optimal from the differance in memory accesing per quarter warp. (Time with your code ~0,102 ms compered to ~0.140 ms)
BUT for any other configuration your indexing will be non optimazed: Why?

  1. Your indexing adds more registers/computations per thread (slows it down)
  2. Since I’m writing 16 bytes per thread the warp scheduler will try to fit this (as I mentioned before) into one 128 bytes memory instruction for every quarter warp ( 8 threads). if it manage to do that then it’s as optimal as it gets regarding coalesced memory access using 16 bytes per thread for compute capabilit 1.2-1.x (I think… 2.x it changes how the warp schedules things)

Since your indexing is always sequential in that meaning that you always write sequential despite the change of block_dim the warp scheduler will always fit every memory transaction for each quarter warp into a 128 bytes instruction and the kernel execution time will be the same around 0.102 ms as long as we test with a total of 64 threads per block.

For x_threads = 2
In my indexing:
this will result in 4*32 bytes memory instructions from the warp scheduler ( My kernel execution time is however faster due to the extra instructions/registers) you use per thread.

In your code:
this will result in 1*128 bytes memory instruction from the warp scheduler

For x_threads = 4
In my indexing:
this will result in 2x64 bytes memory instruction from the warp scheduler.

In your code:
this will result in 1*128 bytes memory instruction from the warp scheduler.

For x_threads = 8
In my indexing:
This will result in 1x128 bytes memory instruction from the warp scheduler.

In you code:
this will result in 1*128 bytes memory instruction from the warp scheduler.

Here when the warp schedulers use one 128 bytes memory instruction you can see the biggest gap between the two implementations :
My: ~0.088 ms
Yours: ~0.102 ms

I will check and see if the registers are spilled to local memory as insmvb00 suggested.