as you noticed, Maximum sizes of each dimension of a block in Fermi is 1024 x 1024 x 64 but the Maximum number of threads per block is 1024.
A block of size 320x320 exceeds the maximum number of threads per block. You can try other configurations as 32x32, 64x16, 1024x1 and check the results and performance.
I tried 1024x1 block, the performance is the same (24ms).
But 32x32 and 64x16 fail with error “e:/GPU/GpuLab/gpu/Filter2D/filter_kernel.cu(59) : cudaSafeCall() Runtime API error : unknown error.”
Compute Visual Profiler gives me this report with 64x16 block. I am wondering can I have a better performance? Actually, my realtime video process application needs the algorithm finishs in 5ms!
The report of the visual profiler show a block size of 20x20x1 not 32x32x1 or 64x16x1.
How arrange threads within a block depends on how data is being to be accessed. On the other hand, you are not using the shared memory which can gives you a better performance.
Your in input data (4000x3000) is not multiple of your block size, neither for a block of 1024x1x1. You are probably writing out of the global memory space allocated in the GPU.
Next question here is, Q: why it works with 1024 if it’s not multiple of 4000 or 3000?. A: In some cases store data out of the allocated space does not crashed your application, although it’s an access violation. When you copy back data to the CPU you copy the exactly bytes reserved so you dont notice you wrote out of the bound.
As a good practice, if your data is not multiple of your block size check it before load / store data. In example:
__global__ void pad_kernel(float *dst,float *src,int width,int height)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
int idx = ix + width*iy; // global index
if (idx < width*height) // your load / store accesses should be safe
{
dst[(iy + 2) * (width + 5) + ix + 2] = src[iy * width + ix];
}
}