CUDA C++ Programming

I want to write the following CPU source code using CUDA code, but I don’t know how to write it, so I would like some advice. Please help me.

s = 100;
t = 300;

for(j = 0; j < s; j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j] = b[100];
	}
}

for(j = s; j < (s + 100); j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j] = b[100];
	}
}

for(j = (s + 100); j < t; j++){
	for(i = 0; i < s; i++){
		a[i+j] = b[0];		
	}
	for(i = s; i < (s + 100); i++){
		a[i+j] = b[i];
	}
	for(i = (s + 100); i < t; i++){
		a[i+j] = b[100];
	}
}

For each element in a create a thread, in each thread determine, which b to load and write it in. Do not use any loops.

I first learned CUDA from this:

“For each element in a create a thread, in each thread determine”
→It was expressed as follows.
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int jdx = blockIdx.y * blockDim.y + threadIdx.y;

“which b to load and write it in.”
→I’m not getting a clear picture in my mind. I can’t imagine it.
How would you write source code specifically?

You only need one dimension for the elements of a. So just
int idx = blockIdx.x * blockDim.x + threadIdx.x;

Find out how many elements a has (what is the maximum index).
Then find out, which intervals of indices run which code, some array elements are written to more than once, find out, which is written to last.