Best practice for grids writing to arrays

This is probably a rather basic question but I can’t seem to find any specific answers, so I figured I would ask. If I have input and output arrays of size X and I need to run a nested loop on it of size 50 (so 50^2 operations on each object in the input array) what would be the best way of distributing and writing back to the array?

So right now I have something like this (this is super pared down)
subroutine global launchLoop(inputArray, outputArray){
array :: inputArray, outputArray
call theNestedLoop(input(blockID),outputArray)
subroutine device theNestedLoop(input, outputArray){
array :: outputArray
int :: input, outputOfLoop
outputOfLoop = //the result of the nested loop using input
outputArray(blockID) = outputOfLoop
launchLoop<<X,1>>(inputArray, outputArray)

If I understand how CUDA works this launches X blocks of 1 thread each and ideally they would all do their 50^2 loop at the same time (please correct me if I’m wrong) but the loop itself is not in parallel.
I would like to do the 50^2 loop in parallel as well so I was considering using grids, but I don’t know what best practice would be for copying outputOfLoop into outputArray, could someone help me?

Please forgive the basic question, and please let me know if this question is unclear.