I’m using CUDA SDK 4.0 currently and playing with the convolution separable example.
In the example, there is a column convolution kernel and in this kernel, it allocate a bit of the shared memory.
The code is as the following:
__shared__ float s_Data[COLUMNS_BLOCKDIM_X][(COLUMNS_RESULT_STEPS + 2 * COLUMNS_HALO_STEPS) * COLUMNS_BLOCKDIM_Y + 1];
Actually, the size of shared memory which is used is only COLUMNS_BLOCKDIM_X * (COLUMNS_RESULT_STEPS + 2 * COLUMNS_HALO_STEPS) * COLUMNS_BLOCKDIM_Y.
But if we change the last + 1 to + 0, the program will be almost 8 times slower.(from 4400 MPixels/sec to 660 MPixels/sec) If I change + 1 to + 2, then it is a 20% slower, (3300MPixels/sec) and if I change it to + 3, it is almost the same as + 1.
I really don’t understand why this extra COLUMNS_BLOCKDIM_X size of unused shared memory has such a big influence on the performance of the code.
I don’t think it is due to back conflict since I’ve already assign the COLUMNS_BLOCKDIM_X value as 32 which is a warp’s size.
I’m using GTX 560Ti which has a Fermi architecture.
Anyone has any idea about this magical constant?