Configuring the CUDA Kernel

I can’t solve the problem. How should I organize the CUDA core if I use Pitch and memcopy2D for 4x1M and 4x10M arrays? My code always handles a different number of lines.
At the moment I can’t send the whole code, but here is a part of it:

int x = blockDim.x * blockIdx.x + threadIdx.x;
int y = blockDim.y * blockIdx.y + threadIdx.y;
B[y * pitch + j] += A[y * pitch + i] - C[j * pitch + i];

Array B always gives an answer in different ways, initially it is filled with 0, depending on the configuration, it can correctly fill 1-1000 elements, or in chunks (every 250 and so on)

P.S. I’m sorry, I’m just getting acquainted with all this, including the forum, if there was such a question already, don’t scold me, I could have missed it.

RTX 3080Ti