Hi again
I am a student trying to learn more about GPU’s but I have a few questions about the following code:
!$acc region
do k = 1, n1
do i = 1, n3
y=0
do j = 1, n2
y = y + a(i,j) * b(j,k)
enddo
c(i,k) = y
enddo
enddo
!$acc end region
This code comes from the matrix multiplication sample provided by PGI and I have tried running it but the innermost loop does not seem to be parallelized. If possible could someone help me completely parallelize all the loops? The message I receive is:
37, Loop is parallelizable
38, Loop is parallelizable
Accelerator kernel generated
37, !$acc do parallel, vector(16)
38, !$acc do parallel, vector(16)
CC 1.0 : 12 registers; 24 shared, 64 constant, 0 local memory bytes; 66 occupancy
CC 1.3 : 12 registers; 24 shared, 64 constant, 0 local memory bytes; 100 occupancy
41, Loop is parallelizable
57, Loop interchange produces reordered loop nest: 57,59,58
If you are wondering why this code has been rewritten from the original:
!$acc region
do k = 1,n1
do i = 1,n3
c(i,k) = 0.0
do j = 1,n2
c(i,k) = c(i,k) + a(i,j) * b(j,k)
enddo
enddo
enddo
!$acc end region
The reason is that when I tried to compile the original code, I would receive the following message:
60, Complex loop carried dependence of ‘c’ prevents parallelization
Loop carried reuse of ‘c’ prevents parallelization
Inner sequential loop scheduled on accelerator
(On a side note, variables x and m were not accepted in the loops for some obscure reason) Please let me know if anyone has come across those messages.
Thank you for your time!
-Chris