How warp serialization works on shared memory How to run a "data[n] += something" efficientl

The first way is how I do the convolution…(I guess that you mean s_Volume and not s_Image), will try to flip x and y but does it really make any difference if I use a thread configuration that is

dimBlock = dim3(8,8,7) ?

Does it matter if I store the data like

shared float s_Volume[16][15][16]

instead of

shared float s_Volume[16][16][15]

?

Hey Wandrine

3D stuff is usually complicated to grasp (at least for me) on just a whim. I would rewrite your code to use a linear reference counter s_Image[threadIdx.z + y_offset + x_offset] or something similar. That way you can easily check whether each thread in a warp accesses memory in an orderly fashion. The other way s_Image[column][row][depth] also makes it a bit tricky due to how the compiler reads it.

Hey Wandrine

3D stuff is usually complicated to grasp (at least for me) on just a whim. I would rewrite your code to use a linear reference counter s_Image[threadIdx.z + y_offset + x_offset] or something similar. That way you can easily check whether each thread in a warp accesses memory in an orderly fashion. The other way s_Image[column][row][depth] also makes it a bit tricky due to how the compiler reads it.

Yeah 3D is a bit tricky, but I got a 2x speedup by switching the x and y indices…

Yeah 3D is a bit tricky, but I got a 2x speedup by switching the x and y indices…

Avoiding bank conflicts means that the lowest four bits of the thread index need to map uniquely to bits 2…6 of the shared memory address. With dimBlock.x=8, the lowest four bits consist of threadIdx.x and the lowest bit of threadIdx.y. I we neglect threadIdx.y, we get at most two-way bank conflicts, which is not too bad and still a lot better than 16-way.

The obvious way to avoid higher that 2-way bank conflicts is to index the last index of s_Volume with x: [font=“Courier New”]s_Volume[…][…][threadIdx.x + x_offset][/font].

As you have defined s_Volume as [font=“Courier New”]shared float s_Volume[16][16][15][/font], you can however also achieve this with [font=“Courier New”]s_Volume[…][threadIdx.x + x_offset][…][/font]. With a declaration of [font=“Courier New”]shared float s_Volume[16][15][16][/font] this would lead to 8- or 16-way bank conflicts.

Avoiding bank conflicts means that the lowest four bits of the thread index need to map uniquely to bits 2…6 of the shared memory address. With dimBlock.x=8, the lowest four bits consist of threadIdx.x and the lowest bit of threadIdx.y. I we neglect threadIdx.y, we get at most two-way bank conflicts, which is not too bad and still a lot better than 16-way.

The obvious way to avoid higher that 2-way bank conflicts is to index the last index of s_Volume with x: [font=“Courier New”]s_Volume[…][…][threadIdx.x + x_offset][/font].

As you have defined s_Volume as [font=“Courier New”]shared float s_Volume[16][16][15][/font], you can however also achieve this with [font=“Courier New”]s_Volume[…][threadIdx.x + x_offset][…][/font]. With a declaration of [font=“Courier New”]shared float s_Volume[16][15][16][/font] this would lead to 8- or 16-way bank conflicts.