Sparse Matrix Vector Multiplication

Hi,

I’m new to gpu programming but am very interested in speeding up our conjugate gradient algorithm. With a simple preconditioner the computational bottleneck should be the sparse matrix vector multiplication. I’ve been experimenting in order to get the CRS matrix vector products efficiently implemented on the gpu. When I use the following code I find that one core on the cpu is almost 5 times faster then on the gpu (i7 960 vs gts 450 with about 1 million nnz in A). I was hoping that when I perform many spMV operations with the same matrix on different vectors I could see a significant speedup. Any suggestions would be greatly appreciated.

Thanks

nitt=1000

!\$acc data region local(t,y)
do j=1,nitt ! itterations

!\$acc region
do i = 1, nrow ! SpMV products
t=0
do k = ia(i), ia(i+1)-1
t=t+a(k)*x(ja(k))
enddo
y(i)=t
enddo
x=y ! To test set x to be something new each iteration

!\$acc end region
enddo
!\$acc end data region

Here is the output from compiling the code. It has chosen to vectorize the loop over the rows.

pgfortran -o f1.exe f1.f90 -ta=nvidia -Minfo=accel -fast
main:
47, Generating local(y(:))
Generating local(t)
49, Generating copyin(ja(:))
Generating copyin(x(:))
Generating copyout(x(1:111872))
Generating copyin(a(:))
Generating copyin(ia(1:111873))
Generating compute capability 1.3 binary
50, Loop is parallelizable
Accelerator kernel generated
50, !\$acc do parallel, vector(256)
Cached references to size [257] block of ‘ia’
CC 1.3 : 16 registers; 1052 shared, 132 constant, 0 local memory by
tes; 100 occupancy
52, Loop is parallelizable
57, Loop is parallelizable
Accelerator kernel generated
57, !\$acc do parallel, vector(256)
CC 1.3 : 4 registers; 20 shared, 112 constant, 0 local memory bytes
; 100 occupancy

Hi elliotmh,

One thing that will help would be to add a “copyin(ia,ja,a)” and “copy(x)” to your data region. This should help reduce the amount of data copied. Right now you’re copying “a”, “ja”, and “ia” in and copying “x” in and out 1000 times.

The compiler is able to recognize reductions and generate an optimized parallel reduction kernel. However, since one GPU kernel can’t be launched from within another, the parallel reduction can’t be generated here. This is why only the i loop is being accelerated.

A secondary issue is that loops must be rectangular in shape in order to be accelerated (i.e. the loop bounds must be know before entering a region). Since the k-loop’s bounds is determined within the body, this loop is not able to be accelerated.

Another issue is your memory accesses of “a”, “ja” and “x”. Ideally you want to access memory using the vector loop index (which is “i” in this case). Otherwise you get memory divergence (See: http://www.pgroup.com/lit/articles/insider/v2n1a5.htm)

Finally, your compute intensity is very low. Compute intensity is the ratio of computation over data movement. With only one multiple and and one add and no data reuse, the code may not be suitable for a GPU.

How big is nrow (the i-loop trip count) and the average size of ia(i) to ia(i+1) (the k-loop trip count)?

If the i-loop is large and k-loop is small, then keep the code as is and just fix the memory copys in the data region.

if the i-loop is small and k-loop is large, move the acc region around the k-loop only. This will help with memory divergence and may give you enough parallelization to overcome the low compute intensity.

If both the i and j loops are large, try recoding to some thing like the following:

``````real, dimension(:,:), allocatable :: temp
allocate(temp(nrow,maxval))
....
nitt=1000

!\$acc data region local(t,y),copy(x), copyin(ia,ja,a)
do j=1,nitt ! itterations

!\$acc region
temp=0
do i = 1, nrow ! SpMV products
do k = 1, maxval
if (k .ge. ia(i) .and. k .lt. ia(i)) then
temp(i,j)=a(k)*x(ja(k))
endif
enddo
enddo

do i = 1, nrow ! SpMV products
y(i)=0
do k = 1, maxval
y(i) = y(i) + temp(i,j)
enddo
enddo

x=y ! To test set x to be something new each iteration

!\$acc end region
enddo
!\$acc end data region
``````

Hope this helps,
Mat