I wonder maximum number of threads per block really limits the number of threads in each block.

I have found that CUDA device_info gives maximum number of threads per block. However, when I run some CUDA file, it worked well!

This is the information of DEVICE about maximum number of threads per block:

Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024

This is the code of my CUDA file:

dim3 threadPB(1024,1024,64);

As you can see, the number of threads I allocate is 67108864 and it exceeds 1024(=Maximum number of threads per block).

Does it really cause a problem to allocate threads at each block more than threads device allows?

do you really allocate 67108864 threads, or do you allocate 64 x 1024 x 1024 threads - a thread block of 1024 threads that is run 1024 x 64 times?

as a simple test, try to seat a kernel with:

dim3 dB(1024x1024x64, 1, 1)

or even
dim3 dB(2048,1, 1)

you can easily step the program in the debugger, suspend it, and note what kernel thread blocks are resident, with their dimensions

This is my source code:

#define N 100000
__global__ void add(int *a, int *b, int *c){
        int tid = threadIdx.x + blockDim.x *threadIdx.y + blockDim.y *blockDim.x * threadIdx.z;
        printf("[%d] x=%d, y=%d, z=%d\n",tid,threadIdx.x,threadIdx.y,threadIdx.z);
        //while(tid< N){
                c[tid] = a[tid]+ b[tid];
        //      tid += blockDim.x * gridDim.x;

int main (){
        struct timeval tv1,tv2,ctv1,ctv2;

        int a[N], b[N],c[N];
        int *dev_a, *dev_b,*dev_c;

        cudaMalloc((void **)&dev_a, N * sizeof(int));
        cudaMalloc((void **)&dev_b, N * sizeof(int));
        cudaMalloc((void **)&dev_c, N * sizeof(int));
        int i;
                a[i] =-i;
                b[i]= i*i;

        cudaMemcpy(dev_a, a, N * sizeof(int),cudaMemcpyHostToDevice);
        cudaMemcpy(dev_b, b, N * sizeof(int),cudaMemcpyHostToDevice);

        dim3 blockPG(2147483647, 65535, 65535);
        dim3 threadPB(16,16,4);

        cudaMemcpy(c, dev_c, N * sizeof(int),cudaMemcpyDeviceToHost);
        //for(i=0;i<N;i++) printf("%d + %d = %d\n",a[i] ,b[i], c[i]);
                if((a[i]+b[i])!=c[i]) {

The CUDA file of above post works well; it has 16,16,4 thread per block.

However, when the number of thread per block exceeds the device’s limit, it does not print sequence while the result is correct.

This is the result of CUDA when I set 1024 threads:

[727] x=7, y=13, z=2
[728] x=8, y=13, z=2
[729] x=9, y=13, z=2
[730] x=10, y=13, z=2
[731] x=11, y=13, z=2
[732] x=12, y=13, z=2
[733] x=13, y=13, z=2
[734] x=14, y=13, z=2
[735] x=15, y=13, z=2

This is the result when I set 4096 threads:

Wrong in c[0]

It does not only print any error message but also correct result.

Does CUDA compiler check whether threads per block in CUDA file exceeds limit of device?

Check CUDA return codes for errors, and (for a test) zero the result space before launching the kernel.

CUDA does not clear allocated memory, so you are very likely to find the (correct) results from a previous successful kernel launch. This is a common cause for confusion for new CUDA users.