i have such problem :
for i from 1 to 16
for j from 1 to 4
execute function on variable X
write X on some array
end of loops
and i have a couple of GTX 295’s . where is the best strategy to parallelized this simple fellow ? my function is just a couple of shifts and at the end of the story i need to have a huge array of X’s computed by this , then transfer the array to host and write it to a file