Сan`t understand what grid dimension to use (cudaDeviceSynchronize error code 4)

Technical Specifications 6.1
Maximum number of threads per block 1024
Maximum x-dimension of a grid of thread blocks 2^31-1

I want to run 2 ^ 32 threads, 32 registor per thread, 128 threads per block (<1024)

calc => (2^32)/128 = 33554432 bloks (< 2^31-1)

I steup kernel: kernel <<<33554432 ,128>>>() (this does not contradict the limitations)

but udaDeviceSynchronize returned error code 4 after launching addKernel

error does not occur if use: kernel <<<2097152 ,128>>>() (releas) and kernel <<<~4194304,128>>>() (debug)

There is a rule explaining why I can not take the number of blocks more than 2097152 ?

there is no rule like you imagine

You may be hitting a launch timeout with the larger block configuration, or simply have some other issue that needs to be debugged.