I am a beginner of CUDA. I have a problem about block size, which have confused me for some time. That is, I have the following code:
////////////////////////////////////////////////////////// #define BLOCK_SIZE 16
unsigned int n = 512;
uint3 sf_dim_b = make_uint3(BLOCK_SIZE, BLOCK_SIZE,1);
uint3 sf_dim_g = make_uint3(ceil(n/(float)sf_dim_b.x), ceil(n/(float)sf_dim_b.y) ,1);
uint3 sf_dim_g_2 = make_uint3(ceil((n+2)/(float)sf_dim_b.x), ceil((n+2)/(float)sf_dim_b.y), 1);

global void my_function_1(…)
{
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
…
}

global void my_function_2(…)
{
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
…
}
/////////////////////////////////////////////////////////
BTW, My video card is 8800GT.

When I set the block size to 1 or 2, it works very well. However, If I set it to 16 or some other value, there will be “unspecified launch error”. This perhaps because of (n+2)%BLOCK_SIZE != 0. However, there could be inferior performance when block size is 1 or 2. So, I try to achieve best performance

When the block size does not evenly divide the size of the problem, the total number of threads will be greater than the problem size. You will usually have to tell the extra threads at the end to quit early, otherwise they will attempt to read or write beyond the end of the array.

Like so:

__global__ void my_function_1(..............)
{
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
if (y >= n || x >= n) {
return;
}
.......
}

One other comment, is I generally prefer to use integer arithmetic (n+BLOCK_SIZE-1)/BLOCK_SIZE instead of ceil() and floats, because I get paranoid about roundoff. I know 625.0/25.0 should be 25.0, but it’s hard for me to be sure that it won’t be 25.0000001 and give a wrong ceil() value.