i have a project where i have an array A with length N. and array B
i want to divide the array A into segments each segment is assigned to a specific block threads.

i want to launch three blocks first block calculate the sum of elements form 0-3, second block calculate the sum of elements form 4-6, and third block calculate the sum of elements form 7-11.

i am not asking for code. i am asking for algorithm

for ( int i = 0; i < N/3; ++i ) {
  int t = i*3;
  calculate using A[t..t+2] and B[0..2]

Launch 3 blocks. In each block do a block-level reduction on the appropriate data set. Use the blockIdx.x built-in variable in your CUDA kernel to select the appropriate element of an array that defines the data boundaries.

You can write your own block-level reduction, but the CUDA reduction sample code is a good thing to review:

and CUB provides block-level reduction as a library operation: