Well, you wouldn’t be able to parallelize the inner part (because of the requirement of knowing the previous value), but depending on the size of your data (the value of MAX in this case), you may be able to parallelize the outer loop since there is no restriction there. Basically, you could assign each thread to one value of ‘i’ and have it run the inner loop over the ‘j’ values. If the data was really large, you’d have to use some shared memory for caching some of the data values for each thread (so you’re not constantly reading/writing from global memory).
You can only parallelize the outer loop because each iteration of the inner loop depends on the calculations of the previous iteration of the loop (except for the first one, of course). That means that it needs to be run serially. However, the calculations inside the inner loop have no sequential requirements of the outer loop, so you can call a bunch of threads where the thread index will take the place of the outer loop’s index (‘i’) and each thread will run the inner loop serially.