Hi,
I’m new to gpu programming but am very interested in speeding up our conjugate gradient algorithm. With a simple preconditioner the computational bottleneck should be the sparse matrix vector multiplication. I’ve been experimenting in order to get the CRS matrix vector products efficiently implemented on the gpu. When I use the following code I find that one core on the cpu is almost 5 times faster then on the gpu (i7 960 vs gts 450 with about 1 million nnz in A). I was hoping that when I perform many spMV operations with the same matrix on different vectors I could see a significant speedup. Any suggestions would be greatly appreciated.
Thanks
nitt=1000
!$acc data region local(t,y)
do j=1,nitt ! itterations
!$acc region
do i = 1, nrow ! SpMV products
t=0
do k = ia(i), ia(i+1)-1
t=t+a(k)*x(ja(k))
enddo
y(i)=t
enddo
x=y ! To test set x to be something new each iteration
!$acc end region
enddo
!$acc end data region
Here is the output from compiling the code. It has chosen to vectorize the loop over the rows.
pgfortran -o f1.exe f1.f90 -ta=nvidia -Minfo=accel -fast
NOTE: your trial license will expire in 7 days, 5.63 hours.
NOTE: your trial license will expire in 7 days, 5.63 hours.
main:
47, Generating local(y(:))
Generating local(t)
49, Generating copyin(ja(:))
Generating copyin(x(:))
Generating copyout(x(1:111872))
Generating copyin(a(:))
Generating copyin(ia(1:111873))
Generating compute capability 1.3 binary
50, Loop is parallelizable
Accelerator kernel generated
50, !$acc do parallel, vector(256)
Cached references to size [257] block of ‘ia’
CC 1.3 : 16 registers; 1052 shared, 132 constant, 0 local memory by
tes; 100 occupancy
52, Loop is parallelizable
57, Loop is parallelizable
Accelerator kernel generated
57, !$acc do parallel, vector(256)
CC 1.3 : 4 registers; 20 shared, 112 constant, 0 local memory bytes
; 100 occupancy