Hi,

I’m new to gpu programming but am very interested in speeding up our conjugate gradient algorithm. With a simple preconditioner the computational bottleneck should be the sparse matrix vector multiplication. I’ve been experimenting in order to get the CRS matrix vector products efficiently implemented on the gpu. When I use the following code I find that one core on the cpu is almost 5 times faster then on the gpu (i7 960 vs gts 450 with about 1 million nnz in A). I was hoping that when I perform many spMV operations with the same matrix on different vectors I could see a significant speedup. Any suggestions would be greatly appreciated.

Thanks

nitt=1000

!$acc data region local(t,y)

do j=1,nitt ! itterations

!$acc region

do i = 1, nrow ! SpMV products

t=0

do k = ia(i), ia(i+1)-1

t=t+a(k)*x(ja(k))

enddo

y(i)=t

enddo

x=y ! To test set x to be something new each iteration

!$acc end region

enddo

!$acc end data region

Here is the output from compiling the code. It has chosen to vectorize the loop over the rows.

pgfortran -o f1.exe f1.f90 -ta=nvidia -Minfo=accel -fast

NOTE: your trial license will expire in 7 days, 5.63 hours.

NOTE: your trial license will expire in 7 days, 5.63 hours.

main:

47, Generating local(y(:))

Generating local(t)

49, Generating copyin(ja(:))

Generating copyin(x(:))

Generating copyout(x(1:111872))

Generating copyin(a(:))

Generating copyin(ia(1:111873))

Generating compute capability 1.3 binary

50, Loop is parallelizable

Accelerator kernel generated

50, !$acc do parallel, vector(256)

Cached references to size [257] block of ‘ia’

CC 1.3 : 16 registers; 1052 shared, 132 constant, 0 local memory by

tes; 100 occupancy

52, Loop is parallelizable

57, Loop is parallelizable

Accelerator kernel generated

57, !$acc do parallel, vector(256)

CC 1.3 : 4 registers; 20 shared, 112 constant, 0 local memory bytes

; 100 occupancy