Best Parallelization Strategy help me to get my first algo parallelized

i have such problem :

for i from 1 to 16
for j from 1 to 4
execute function on variable X
write X on some array
end of loops

and i have a couple of GTX 295’s . where is the best strategy to parallelized this simple fellow ? my function is just a couple of shifts and at the end of the story i need to have a huge array of X’s computed by this , then transfer the array to host and write it to a file

Map <blockId, threadId> to <x,i,j> and let the kernel be a function of <x,i,j>.

Very Jimple, ijnt it?