Comparison CPU - GPU

I’ve got an question for the C (and CUDA) experts out there. I want to write two versions pf the same program: one to be run on the CPU and one on the GPU so that I can compare the two of them.
For the moment I’m working on the CPU program. Basically in the program I have to run a loop for hundreds of time. In the loop I calculate (the outer loop contains two inner loops):
u_new[i][j]=u_old[i][j]; (obviously this type of program should run faster on a GPU)

Now the problem is that during the next iteration I have to calculate new values for u_new based on the old values of u_new. I’ve tried three different versions:

  1. At the end of the outmost loop I simply copy u_new to u_old.
  2. I use dynamically allocated arrays and interchange the pointers (actually the pointers to the pointers because it’s a 2D array)
  3. I use an if condition and if it’s an even loop then the array u_new will contain the new values which are calculated based on the old ones and if it’s an odd loop then the array u_old will contain the new values which are calculated based on the old ones (which are in u_new).

It seems that the third version is the fastest but the differences are small (100ms at 7 seconds).

Does anyone know better solutions for this problem? And which of the versions do you think would work best on a GPU?

Thanks a lot!

Why do you use pointers to pointer but not just allocate one memory block?

Because I want to address the values as u[i][j], rather than u[i*n+j]…but I guess there’s no big difference

On the GPU there is a very big difference between thte two. The first requires two pointer reads to fetch an entry, and then there is no guarantee that memory coalescing requirements can be met when a warp tries to perform a load or store with the array. The second only requires one pointer read and an integer multiply-add. When global memory latency is of the order of thousands of cycle, the is a big difference between the two approaches.

The pointer exchange idea with will be best on the GPU. Keep the outer loop on the host, make the inner loop(s) into a kernel, and use pointer swapping with one kernel launch per outer loop iteration. Kernel launch time is only a few microseconds on most platforms and it shouldn’t have a large effect on overall performance for a non-trivial kernel.


There’s a difference (albeit less crippling) on the CPU too. The caches help, but [font=“Courier New”]a[i][j][/font] syntax is still going to waste cache lines on pointer data, rather than something useful. Nonscientific testing with the code I’m currently working on suggests that there’s a factor two in performance on Nehalem from fixing this. Of course, if I were to do so, it would make my GPU acceleration efforts (which I’m actually paid for) less impressive…