Strangely long loop execution time

Hello everybody,

I’ve got a procedure that I want to partially execute on the GPU. I’ve already succesfully ported some loops taking care of copying essential arrays.
One loop after porting to GPU takes approximately 25 times longer than executed on CPU. The CPU code looks like this:

        do kk=1,igfy
          do nbl=1,nblcks
            do n=ijklim(nbl,1),nijkpr(nbl)
     &              vvect(ijk,igfyp1)*vvect(ijk,kk)
          do nbl=1,nblcks
            do n=ijklim(nbl,1),nijkpr(nbl)
     &                    -hmatrix(kk,igfy)*vvect(ijk,kk)

To make it executable on GPU I’ve changed the code like this:

(630) !$acc region local(kk)
(631)       do kk=1,igfy
(632)          hmatrix(kk,igfy)=zero
(633)	   enddo
(635)        do nbl=1,nblcks
(636)            do kk=1,igfy
(637)          do ijk=imoj4,imoj5
(638)              hmatrix(kk,igfy)=hmatrix(kk,igfy)+
(639)     &              vvect(ijk,igfyp1)*vvect(ijk,kk)
(640)            enddo
(641)          enddo
(642)   	enddo
(644)        do nbl=1,nblcks
(645)          do ijk=imoj4,imoj5
(646)	    do kk=1,igfy
(647)              vvect(ijk,igfyp1)=vvect(ijk,igfyp1)
(648)     &                    -hmatrix(kk,igfy)*vvect(ijk,kk)
(649)            enddo
(650)          enddo
(651)        enddo
(652)!$acc end region

Compilation log:

630: region entered 3990 times
time(us): total=74000000
kernels=36032074 data=49510
631: kernel launched 3990 times
grid: [1] block: [256]
time(us): total=> 28628 > max=140 min=4 avg=7
637: kernel launched 3990 times
grid: [1] block: [32]
time(us): total=> 35830347 > max=11623 min=6040 avg=8980
646: kernel launched 3990 times
grid: [72] block: [256]
time(us): total=> 173099 > max=131 min=15 avg=43

I’ve tried also to switch lines 636 and 637:

         do ijk=imoj4,imoj5
           do kk=1,igfy

but with the same results.

Why the loop on line 637 can take ~2000 times longer that the loop on line 646? Any ideas?


Hi szczelba,

Notice the actual schedule used for each kernel in the profile output. The loop at line 637 uses only a singe block with 32 threads. This a very poor schedule since you’re only using a very small portion of your GPU. The loop at line 646 uses 72 blocks, each having 256 threads. This is much better and shows in the performance.

Can you please post the output when you compile with “-Minfo=accel”?


Sorry for long delay. I’m back in the topic.
I’ve compiled the program with -Minfo=accel but no more accelerator information showed up. So, I still get:

on compilation:

    635, Parallelization would require privatization of array 'hmatrix(1:igfy,igfy)'
    636, Loop carried dependence due to exposed use of 'hmatrix(1:igfy,igfy)' prevents parallelization
    637, Loop is parallelizable
         Accelerator kernel generated
        635, !$acc do seq
             Non-stride-1 accesses for array 'vvect'
        636, !$acc do seq
             Cached references to size [32] block of 'vvect'
        637, !$acc do parallel, vector(32)
             Using register for 'hmatrix'
             CC 1.3 : 18 registers; 276 shared, 188 constant, 0 local memory bytes; 25 occupancy

after execution:

630: region entered 4080 times
        time(us): total=6000000
                  kernels=1606521 data=49558
        631: kernel launched 4080 times
            grid: [1]  block: [256]
            time(us): total=24870 max=118 min=5 avg=6
        637: kernel launched 4080 times
            grid: [1]  block: [32]
            time(us): total=1548237 max=533 min=275 avg=379
        646: kernel launched 4080 times
            grid: [4]  block: [256]
            time(us): total=33414 max=90 min=4 avg=8

Still don’t have any clue how to fix it. Adding “!$acc do parallel, vector(256)” does change the vector size to 256 but the calculations does not speed up at all.

Help please. I’m under pressure of time.

Hi szczelba,

The informational messages indicate that the compiler reordered the loop so that 637 is the outermost and 635 and 636 are executed sequentially within the generated kernel. It’s a reasonable strategy and allows for the code to be parallelized as well as takes advantage of shared memory. The caveat would be if 637’s trip count is small.

What are the values for ‘igfy’, ‘nblcks’, ‘imoj4’, and ‘imoj5’?

  • Mat


igfy = 10
nblcks = 1

Ok, so the compiler treated the loop on 637 as the outermost? This loop was the outermost before, but then it was impossible to parallelize the loop, so I had to switch the loops and put the 637 loop inside.
What about the loop on line 644? It seems similar, but it works fine.

Hi szczelba,

Are you able to send me the code ( I think it would be more beneficial if I have the full source then try to determine the cause with limited information.