Nesting a GPU loop inside a CPU loop?

Hello all!

So my previous question about the GPU compiler was a success! Thanks again for the help on that. Unfortunately, I’m still getting errors in the precision (ie, I should wind up with a error of 16% but end up with 25%, so just a plain mismatching of results).

I think the issue has to do with a variable not being updated properly, or trampled. Now, when I run on Multicore OpenMP I have no issues, but switching over to GPU causes the problems. It may just be that is the nature of the beast, but I was wondering if it had to do with how the code I have is written. Unfortunately, I can’t post that code… at least not the main loops of it. I think I have the reductions right since it was behaving on Multicore.

What I was wondering was if it is possible to have a external CPU multicore loop with a internal GPU loop? Since I can’t post up too much, I’m basically just looking for whether it is possible to run the Multicore code that is working as it is, then just have a dot product call in the middle of it shunt onto the GPU and what that command would look like? If possible, I should be able to work from there. I just wanted to ask before spending a week figuring out it might not work haha!

Thanks again! I do love the nvfortran compiler, the multicore results are indeed better than the ifort ones! Good job!

Sure, you can have an outer OpenMP parallel loop with an inner OpenACC loop which is offloaded to the GPU. I normally don’t recommend do this, especially for folk who want to do this for multi-GPU programing (MPI is better for multi-GPU since it’s easier to manage the different memories). Performance may be poor since I suspect you’ll need to be synchronizing data a lot, but for just trying to chase down numerical accuracy issues, it’s fine.

Though, I’d first try using the flags “-Kieee -Mnofma”. Kieee will use much stricter mathematic operations and nofma will disable Fuse-Multi-Add operations. While FMA is actually more accurate, it can give divergent result if you’re using a CPU without FMA support.

Reductions are problematic, especially when reducing very small or large values. The acclimated rounding error of dozens of CPU threads versus 10s of thousands of GPU threads can be quite different. Here, I’d try using the “num_gangs” and “vector_length” clauses to set these to small sizes (even 1 so it runs sequentially on the device). If the error goes away, then it’s likely caused by the reduction.

Though with this high of difference, it could be something other than numerical differences. There could be race conditions in the code as well (such as forgetting to privatize an array) so be on the look-out for those as well. Or possibly some data isn’t getting synchronized between the host and device so there’s a garbage value being used someplace.

Thanks again for the help.

Sorry for the late reply, I’d been playing around with OpenACC, mostly trying to understand the basics of it. It seems more logical than using OpenMP for GPU stuff.

Reexamining my problem, I think I have to give a sample of what it is I’m trying to do so the issue I’m dealing with is more clear. I apologize that I have not done this sooner, but I suppose I didn’t understand the complexities of what I was trying to do (ie, dot product in a loop with dependencies associated):

A = 400
SIZ = 100

DO 500 I = 1,A

TEMP = 0.0D0

DO K = 1,SIZ
SET1(K) = VEC(K)*MAT(K,VEC2(I))
END DO

DO K = 1,SIZ
TEMP = TEMP + SET1(K)
END DO

IF (TEMP.GT. 0.0D0) THEN
COUNT = COUNT + 1
ENDIF

500 CONTINUE

That’s the general idea, there are a few more “results from the previous A loop are added to the new result” sections in there, but those would be treated the same as the “count” variable. The only loops at any rate that govern execution speed.

I was able to get it to perform great when just using OPENMP for a CPU, the nvfortran works better than the ifort does as far as I can tell (the compiler flags helped that). But whenever I try applying similar rules and restrictions for the GPU it either crashes, falls apart, or doesn’t even start the loop. More often, when it does work, it gives those major differences in results that seem to be the result of the main loop not iterating (ie, it just repeats a out-of-loop initialization over and over) so at least I think I found the source of that issue. I tried allocating the gangs and vector lengths too, but same problems. I’m sure they help if I got the code right, but I think my issue is a more general one I skipped over before…

The reason I was keen on doing a CPUxGPU loop was to have the CPU controlling that A loop and the GPU handling the Dot Product portion (maybe I have that backwards?). But either the code moves much slower than before or I get the mentioned other problems. I guess I’m looking for help on how to go about which controls need to be set since I’m sorta close to just staying with CPU only at this point (which would still be a “success” just not as substantial).

This forum has been very helpful expanding my knowledge of these compilers and coding in general. I appreciate any additional help or suggestions!

I’d just offload the full “A” loop to the GPU. You can run hundreds of thousands of threads on a GPU so offloading just the small loops with only a trip count of 100 would severely under utilize the GPU. Plus you have the overhead of launching 800 kernels, which individually isn’t much, but can add up especially with the small loops.

Try something like the following:

% cat test.F90

program test

integer :: A, SIZ, COUNT, I, K
integer, allocatable,dimension(:) :: VEC2
real, allocatable,dimension(:) :: VEC,SET1
real, allocatable,dimension(:,:) :: MAT
real :: TEMP

A = 400
SIZ = 100

allocate(SET1(SIZ),VEC(SIZ),VEC2(A))
allocate(MAT(SIZ,A))

!$acc enter data create(VEC,VEC2,MAT)

!$acc kernels loop
DO I = 1,A
  VEC2(I) = I
END DO
!$acc kernels loop
DO K = 1,SIZ
   VEC(K) = REAL(K)
ENDDO
!$acc kernels loop
DO I = 1,A
  DO K = 1,SIZ
     MAT(K,I) = REAL(K)
  enddo
enddo

count = 0

!$acc kernels loop private(SET1) reduction(+:count)
DO 500 I = 1,A

TEMP = 0.0D0

!$acc loop vector
DO K = 1,SIZ
SET1(K) = VEC(K)*MAT(K,VEC2(I))
END DO

!$acc loop vector reduction(+:TEMP)
DO K = 1,SIZ
TEMP = TEMP + SET1(K)
END DO

IF (TEMP.GT. 0.0D0) THEN
COUNT = COUNT + 1
ENDIF

500 CONTINUE

print *, count
!$acc exit data delete(VEC,VEC2,MAT)
deallocate(SET1)
deallocate(VEC)
deallocate(VEC2)
deallocate(MAT)

end program test

% nvfortran test.F90 -acc -Minfo=accel ; a.out
test:
     16, Generating enter data create(vec(:),vec2(:),mat(:,:))
     18, Generating implicit copyout(vec2(1:400)) [if not already present]
     19, Loop is parallelizable
         Generating Tesla code
         19, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     22, Generating implicit copyout(vec(1:100)) [if not already present]
     23, Loop is parallelizable
         Generating Tesla code
         23, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     26, Generating implicit copyout(mat(1:100,1:400)) [if not already present]
     27, Loop is parallelizable
     28, Loop is parallelizable
         Generating Tesla code
         27, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         28,   ! blockidx%x threadidx%x auto-collapsed
     35, Generating implicit copyin(mat(1:100,:)) [if not already present]
         Generating implicit copy(count) [if not already present]
         Generating implicit copyin(vec2(1:400),vec(1:100)) [if not already present]
     36, Loop is parallelizable
         Generating Tesla code
         36, !$acc loop gang ! blockidx%x
             Generating reduction(+:count)
         41, !$acc loop vector(128) ! threadidx%x
         46, !$acc loop vector(128) ! threadidx%x
             Generating reduction(+:temp)
     41, Loop is parallelizable
     46, Loop is parallelizable
     57, Generating exit data delete(vec2(:),vec(:),mat(:,:))
          400

Thanks so much for the quick reply! I’ll try that out, definitely not what I was doing before! Thanks again, I’ll report back if it works.

I was able to compile and got the same returns that your example gave me. I’m having to play around with some of the variables being initialized though since I’m getting a error in the terminal output of arrays already being allocated (0: ALLOCATE: array already allocated), but that is probably just something I need to work out (this is ultimately a subroutine that is pulling in variables from a large number of other places. That said, I’m further than I was before by a long shot thanks to your help. Thanks!

I think I’m getting closer, but I’m having what is probably just some variable declaring problems. Since some of these variables are not generated in this particular subscript, would I have to change anything with the “enter data create” variables? If the VEC2 and MAT variables are generated in a separate subroutine, then brought in externally to this subroutine, would I still have them in the “enter data create” line? Thanks!

If the VEC2 and MAT variables are generated in a separate subroutine, then brought in externally to this subroutine, would I still have them in the “enter data create” line?

The scope and lifetime of device data used within an unstructured data region will be the execution of the program between the “enter” and “exit” directives. This includes crossing subroutine boundaries.

Basically, the runtime keeps track of device variables in a “present” table which has the host address, associated device address, and size of the variable. When entering a nested data region (a compute region has an implicit data region), the runtime checks if the variable is “present” and if so uses the device copy of the variable. If the variable is not present, then the runtime will implicitly create and copy the data.

When porting new codes to use OpenACC, I generally start with a top-down approach for data and bottom-up for compute. Meaning, I add “enter data create” directives just after the arrays are allocated (or declared if fix size). Then iteratively add compute directives in the solvers using “update” directives before and after the compute regions to synchronize the host and device copies of the variables. As I add more compute regions, I can move the “update” directive outwards until all the compute is offloaded and no or minimal data movement is needed.

I’ll also tend to add “default(present)” or “present(…list of variables)” on the compute regions. The “present” clause will cause the program to error at runtime if the variable is not actually present and thus eliminate any surprise implicit data copies.

Of course, I leaving out a lot of detail so let me know if you have additional questions. Though for full details, I suggest you take a look at section 2.6 of the OpenACC spec (https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.1-final.pdf)

I’m getting closer, I promise!

Your advice helped, I think I had to include a few more variables in the data create declaration for it to work. The problem I’m having now is that, for some reason, when this subroutine is called numerous times it seems to be “stacking” the results ontop of itself in the “count” variables (there are a few counters). For example, I run the subroutine, it counts the number of cycles it gets a value higher than “x”, but then the subsequent cycle stacks the new “count” ontop of the old one.

The idea is the subroutine counts, finds a response constant, loads that constant out to the main script, then reruns the subroutine again to optimize that constant. I’m not sure why it is adding up the “count” variables though, but I’m certain there is something dumb I’m forgetting or ignoring.

Any advice? With respect to the code you posted a while back, there is a section after the deacclocation that uses those counts to compute that constant, if that helps. There is a “return/end” statement at the end.

The good news is that, internally at least, the values I’m getting from the Dot Product are sensible at last! The first call to the subroutine returns a answer that makes sense finally. I know I keep coming back with new problems, but this has really been helpful and it feels like I’m learning how this thing works. Thanks again!

Off-hand, the only way I can think of that would cause this is if you’re not setting “count=0” before the reduction.

count = 0
!$acc kernels loop private(SET1) reduction(+:count)
DO 500 I = 1,A

If you can post an example snip-it with how “count” is declared (is it a local variable? passed in as an argument? module/common block variable?) and how it’s used, that may be helpful to understanding what happening.

Thanks! That was probably the problem…

The “count” variables were declared like, and I’ve since moved the count data declare below the variables themselves as you showed:

variables that are common across multiple subroutines are initialized
VEC2 and MAT are brought in from outside the Subroutine
below are the “count” variables
!$acc enter data create(CORR1,CORR2,SUB1CT,SUB2CT,DSUM,SCORE)
SEPARATE = .FALSE.
SUB1CT = 0
SUB2CT = 0
CORR1 = 0
CORR2 = 0
DSUM = 0.0D0
!$acc enter data create(VEC,VEC2,MAT)
!$acc kernels loop
DO 100 K=1,SIZ
VEC(K) = REAL(K)
END DO
CALL DVNORM(VEC,SIZ)
!$acc update device(VEC)
Another vector is filled out here, but it doesn’t change from iteration to iteration
DO 500 I = 1,A
Dot Product Code
500 CONTINUE
!$acc update self(DSUM,CORR1,CORR2,SUB1CT,SUB2CT)
Use Dot Product/count values to modify value that goes back out to main script
exit data delete all created
RETURN
END

I’m guessing that the data create needs to come after variables are initialized? In me messing around with the code you had posted before, I rearranged some things since I have some variables that are modified after they are modified and declared based on those previous iterations of this code (thus the hopefully correct use of the update).

I apologize, I know I’m forcing you to work without a full deck of cards. It’s definitely improving though (the code and my level of understanding, even if I’m still a novice at this acc stuff), not getting totally nonsensical results anymore.

No, not necessary. “create” just allocates the data on the device. Now if you need to the device data initialized, then you’ll want to add an “update device” after you initialize on the host to ensure the data is in sync.

Alternately, you can use an “enter data copyin” directive after you initialize it on the host, in which case the device data is both created and updated. I tend to use create and updates since it’s more explicit and is easier to implement the top-down approach I described earlier. But “copyin” works fine too.

Now since these are scalars, you don’t really need to put them into a data region at all. The default for scalars is to make them private or firstprivate so they become local variables rather than needing to be fetched from global memory.

Reduction variables are a special case in that the compiler will create a local variable for the partial reduction and then creates a second kernel that does the final reduction. If not in a data region, the result is then implicitly copied back and added to the host variable. With the reduction variable in a data region, the result is added to the device copy (not copied back to the host). Personally the only time I put reduction variables in a data region is when wanting to use “async” since the kernel would block waiting for the reduction variable to be copied back. Otherwise, it’s easier to just let the compiler manage these for you.

Now this bit of code looks problematic to me:

!$acc enter data create(VEC,VEC2,MAT)
!$acc kernels loop
DO 100 K=1,SIZ
VEC(K) = REAL(K)
END DO
CALL DVNORM(VEC,SIZ)
!$acc update device(VEC)

You create VEC on the device and then initialize it in device compute region. That’s fine, but what’s happening in DVNORM? If this is purely on the host, then it’s using the host copy of VEC which is not initialized. If DVNORM is also using the device, then the “update device(VEC)” will update the device with the uninitialized values in the host copy of VEC. Any value set in DVNORM would be wiped out.

Assuming DVNROM is host only, I’d not offload loop 100 since it’s just setting the initial value. Although you could, it’s probably be more expensive to put in a an “update self(VEC)” after the loop (which is the other fix). Granted, if you are planning on DVNORM later, then you might want to add the “update self” as good porting practice. Just remove it and the “update device” after you offload DVNORM.

If DVNORM is already on the device, then remove the update device so you don’t clobber the device copy.

Understanding and managing two discreate memories is one of the more challenging aspects of using GPUs. Although there is an association between the host and device copies of a variable, they are separate and distinct.

I know I’m forcing you to work without a full deck of cards.

That’s normal for me ;-)