Comparison CPU - GPU

Ionica · January 3, 2011, 4:26pm

Hi!
I’ve got an question for the C (and CUDA) experts out there. I want to write two versions pf the same program: one to be run on the CPU and one on the GPU so that I can compare the two of them.
For the moment I’m working on the CPU program. Basically in the program I have to run a loop for hundreds of time. In the loop I calculate (the outer loop contains two inner loops):
u_new[i][j]=u_old[i][j]; (obviously this type of program should run faster on a GPU)

Now the problem is that during the next iteration I have to calculate new values for u_new based on the old values of u_new. I’ve tried three different versions:

At the end of the outmost loop I simply copy u_new to u_old.
I use dynamically allocated arrays and interchange the pointers (actually the pointers to the pointers because it’s a 2D array)
I use an if condition and if it’s an even loop then the array u_new will contain the new values which are calculated based on the old ones and if it’s an odd loop then the array u_old will contain the new values which are calculated based on the old ones (which are in u_new).

It seems that the third version is the fastest but the differences are small (100ms at 7 seconds).

Does anyone know better solutions for this problem? And which of the versions do you think would work best on a GPU?

Thanks a lot!

Lev · January 3, 2011, 5:07pm

Hi!

I’ve got an question for the C (and CUDA) experts out there. I want to write two versions pf the same program: one to be run on the CPU and one on the GPU so that I can compare the two of them.

For the moment I’m working on the CPU program. Basically in the program I have to run a loop for hundreds of time. In the loop I calculate (the outer loop contains two inner loops):

u_new[i][j]=u_old[i][j]; (obviously this type of program should run faster on a GPU)

Now the problem is that during the next iteration I have to calculate new values for u_new based on the old values of u_new. I’ve tried three different versions:

At the end of the outmost loop I simply copy u_new to u_old.

I use dynamically allocated arrays and interchange the pointers (actually the pointers to the pointers because it’s a 2D array)

I use an if condition and if it’s an even loop then the array u_new will contain the new values which are calculated based on the old ones and if it’s an odd loop then the array u_old will contain the new values which are calculated based on the old ones (which are in u_new).

It seems that the third version is the fastest but the differences are small (100ms at 7 seconds).

Does anyone know better solutions for this problem? And which of the versions do you think would work best on a GPU?

Thanks a lot!

Why do you use pointers to pointer but not just allocate one memory block?

Ionica · January 4, 2011, 7:16am

Because I want to address the values as u[i][j], rather than u[i*n+j]…but I guess there’s no big difference

avidday · January 4, 2011, 9:10am

On the GPU there is a very big difference between thte two. The first requires two pointer reads to fetch an entry, and then there is no guarantee that memory coalescing requirements can be met when a warp tries to perform a load or store with the array. The second only requires one pointer read and an integer multiply-add. When global memory latency is of the order of thousands of cycle, the is a big difference between the two approaches.

The pointer exchange idea with will be best on the GPU. Keep the outer loop on the host, make the inner loop(s) into a kernel, and use pointer swapping with one kernel launch per outer loop iteration. Kernel launch time is only a few microseconds on most platforms and it shouldn’t have a large effect on overall performance for a non-trivial kernel.

YDD · January 4, 2011, 2:47pm

mutters

There’s a difference (albeit less crippling) on the CPU too. The caches help, but [font=“Courier New”]a[i][j][/font] syntax is still going to waste cache lines on pointer data, rather than something useful. Nonscientific testing with the code I’m currently working on suggests that there’s a factor two in performance on Nehalem from fixing this. Of course, if I were to do so, it would make my GPU acceleration efforts (which I’m actually paid for) less impressive…

Topic		Replies	Views
Performance gap for a short test code between GPU and CPU CUDA Programming and Performance	8	1871	October 26, 2017
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8622	December 18, 2008
Loop inside kernel or over kernels in host code? [performance question] CUDA Programming and Performance	8	6741	September 25, 2008
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6553	February 19, 2009
CPU vs GPU performance CUDA Programming and Performance	3	485	December 16, 2018
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	252	July 7, 2024
CUDA slower than CPU? CUDA Programming and Performance	7	848	August 18, 2023
Comparison of execution time in CPU and GPU is the CPU better than GPU in execution time ??? CUDA Programming and Performance	6	10524	September 17, 2010
Run CUDA on CPU while GPU is present CUDA Programming and Performance	5	3010	April 25, 2011
CUDA and fixed-point comparaison on big array Is CUDA suitable for fixed-point comparaison? CUDA Programming and Performance	7	2508	May 9, 2011

Comparison CPU - GPU

Related topics