Hi!
I’ve got an question for the C (and CUDA) experts out there. I want to write two versions pf the same program: one to be run on the CPU and one on the GPU so that I can compare the two of them.
For the moment I’m working on the CPU program. Basically in the program I have to run a loop for hundreds of time. In the loop I calculate (the outer loop contains two inner loops):
u_new[i][j]=u_old[i][j]; (obviously this type of program should run faster on a GPU)
Now the problem is that during the next iteration I have to calculate new values for u_new based on the old values of u_new. I’ve tried three different versions:
At the end of the outmost loop I simply copy u_new to u_old.
I use dynamically allocated arrays and interchange the pointers (actually the pointers to the pointers because it’s a 2D array)
I use an if condition and if it’s an even loop then the array u_new will contain the new values which are calculated based on the old ones and if it’s an odd loop then the array u_old will contain the new values which are calculated based on the old ones (which are in u_new).
It seems that the third version is the fastest but the differences are small (100ms at 7 seconds).
Does anyone know better solutions for this problem? And which of the versions do you think would work best on a GPU?
On the GPU there is a very big difference between thte two. The first requires two pointer reads to fetch an entry, and then there is no guarantee that memory coalescing requirements can be met when a warp tries to perform a load or store with the array. The second only requires one pointer read and an integer multiply-add. When global memory latency is of the order of thousands of cycle, the is a big difference between the two approaches.
The pointer exchange idea with will be best on the GPU. Keep the outer loop on the host, make the inner loop(s) into a kernel, and use pointer swapping with one kernel launch per outer loop iteration. Kernel launch time is only a few microseconds on most platforms and it shouldn’t have a large effect on overall performance for a non-trivial kernel.
There’s a difference (albeit less crippling) on the CPU too. The caches help, but [font=“Courier New”]a[i][j][/font] syntax is still going to waste cache lines on pointer data, rather than something useful. Nonscientific testing with the code I’m currently working on suggests that there’s a factor two in performance on Nehalem from fixing this. Of course, if I were to do so, it would make my GPU acceleration efforts (which I’m actually paid for) less impressive…