The magic 1 in the SDK convolution separable example

Hi guys

I’m using CUDA SDK 4.0 currently and playing with the convolution separable example.

In the example, there is a column convolution kernel and in this kernel, it allocate a bit of the shared memory.

The code is as the following:

__shared__ float s_Data[COLUMNS_BLOCKDIM_X][(COLUMNS_RESULT_STEPS + 2 * COLUMNS_HALO_STEPS) * COLUMNS_BLOCKDIM_Y + 1];

Actually, the size of shared memory which is used is only COLUMNS_BLOCKDIM_X * (COLUMNS_RESULT_STEPS + 2 * COLUMNS_HALO_STEPS) * COLUMNS_BLOCKDIM_Y.

But if we change the last + 1 to + 0, the program will be almost 8 times slower.(from 4400 MPixels/sec to 660 MPixels/sec) If I change + 1 to + 2, then it is a 20% slower, (3300MPixels/sec) and if I change it to + 3, it is almost the same as + 1.

I really don’t understand why this extra COLUMNS_BLOCKDIM_X size of unused shared memory has such a big influence on the performance of the code.

I don’t think it is due to back conflict since I’ve already assign the COLUMNS_BLOCKDIM_X value as 32 which is a warp’s size.

I’m using GTX 560Ti which has a Fermi architecture.

Anyone has any idea about this magical constant?

Thanks

Tao Zhou

This dimension is padded by 1 to avoid shared memory bank conflicts. Banks conflicts can be a serious drag on performance as you noticed when you removed the (not-so-magic) 1.

Hi njuffa

Thank you very much for your fast reply!

Tao