porting the code I’m working right now to OpenACC was a bit tricky, but I’m now confident that the code is working OK with OpenACC and our P100 card. Now the question is to try to get better performance.
The code in question is at: https://filenurse.com/download/5fe52487fabb829eef7b8f75d2f85403.html
To compile both the regular and the OpenACC version I do:
make clean ; make OACC=0 ; make clean ; make OACC=1
And to run the code in both versions:
./rii_acc.x > rii.out.acc ; ./rii_cpu.x > rii.out.cpu
The maximum relative error in my machine run like that is ~1e-9, so that looks good. (as reported using numdiff: numdiff rii.out.acc rii.out.cpu -q -S)
But the performance is not very good at the moment. There are two ‘heavy’ routines, riiparts and rii_matrix. The code does a 100 times loop calling both routines, and in my machine each loop (one execution of both routines) takes around 42 ms.
In the OpenACC version these two routines end up being three kernels: riiparts_173_gpu, rii_matrix_252_gpu and rii_matrix_294_gpu, and by looking at the start times of the kernels executions, one executiong of the three kernels takes around 2.78 ms.
So that is around 15x speedup, which doesn’t look bad, but a colleague working with a similar routine (though less tuned for the CPU) was getting around 80x speedup. At the same time, the profiler (pgprof) tells me that the occupancy was very low: 22.7% for riiparts_173_gpu, 4.9% for rii_matrix_252_gpu and 6.9% for rii_matrix_294_gpu, so I guess there is room for improvement.
But at the moment I’m mostly trying to change gangs, worker and vectors numbers more or less at random, but without a clear understanding on what I should do.
Can you give me some pointers to reading material or similar to better understand how I should play with the distribution of threads in the GPU to try to reach better occupancy (and thus hopefully better performance)? [If you actually play with the code and get better execution numbers, then that gives you bonus points :-) ]