How to compile existing C/C++ project w/ NVIDIA GPU?

LET · July 27, 2012, 1:58pm

Hello,

I am hoping for some advice/syntax on how to successfully compile a hybrid project with C/C++ code whose pragma OpenMP statements have been replaced with pragma acc statements. I am used to GUI-based Visual Studio programming, and having a hard time to getting it to compile under the command line. If modifications need to be done – which I suspect (isolating what I need to compile w/ the pragmas) I’d appreciate knowing the syntax on how to compile that separately and link it back with the main source, and ideally doing as little changes as possible to the original code.

(Note, if the below is too terse, I’d be glad to send along the code to the PGI team to help in getting it to compile – please let me know)

I am using Windows 7 x64, w/ v12.6 of PGI Accelerator.

The structure of the code is as follows:

MAIN.cpp – main entry point, creates instance of MAINclass and calls initApp, a subclass of MAINclass, located in MAINclass.cpp – includes: MAIN.h, MAINclass.h

MAINclass.cpp – Contains 2 functions, both subclasses of MAINclass. One is a helper function (complex product calculation) and initApp(), which calls ReadData() in ReadData.h and then calls initialization() located in initialization.h. includes: MAIN.h, MAINclass.h, ReadData.h, calcJ.h, initialization.h, WriteData.h.

MAIN.h – just includes standard headers used in app – stdlib.h, stdio.h, math.h, string.h, time.h. No other statements.

MAINclass.h – defines constants (eg, PI) defines public and private class functions and variables used throughout program. – includes: structures.h, which is a Complex number struct used throughout program. Also contains a #define MAINclass_h statement.

structures.h – just contains a Complex struct definition. no includes, no other code.

initialization.h – subclass of MAINclass - creates more 1D/2D arrays needed using C++'s new construct and calls calcJ(Jx,Jy,Jz) located on calcJ.h. Also calls WriteData() located in WriteData.h. no includes.

WriteData.h – subclass of MAINclass, otherwise straight C code to write data. no includes

calcJ.h – subclass of MAINclass, and part where pragma statements reside, uses some new C++ calls to init some local arrays, but otherwise pure C code. no includes

ReadData.h – subclass of MAINclass, mixed c/c++ code to read data and init arrays to hold data. no includes

MatColgrove · July 27, 2012, 3:17pm

Hi vacaloca,

We’re happy to help, but we need a bit more specifics as to what “having a hard time to getting it to compile” means. Do you mean that you need to understand the basics of using the compiler from a command line and what flags to? How to set-up a build command line build environment using tools such as ‘make’? Is there an error when you compile? Do you need advice on using OpenACC?

For basic usage questions, you should start by reading chapters 1 and 2 of the PGI User’s Guide. Though, please ask if anything doesn’t make sense or needs to be clarified.

For compilation issue, please post the error you are getting and if possible, a small example code that causes the problem. If the file is too big to post, then please send them to PGI Customer Support (trs@pgroup.com) and ask them to forward the files to me (Mat).

As for OpenACC, again it’s too big of a topic for this post so I’ll refer you to our documentation. Granted, openACC is very new so most of our training material is for the PGI Accelerator Model. Though, OpenACC is in large part part based upon the PGI Accelerator Model so is relevant. I would advise you to start here: PGI Compilers with OpenACC | PGI. If you don’t mind looking at Fortran, I wrote an article which walks through the porting process and might be helpful (Account Login | PGI).

One thing you may be interested in is that we do have a Visual Studio 2008 Plugin for C. It’s still very much a beta product so we’re looking for users that are willing to try it out and give us feedback. Let me know and I can get you in contact with the right person.

Best Regards,
Mat

LET · July 27, 2012, 4:18pm

Yes, sorry about being too vague. Basically yes, I’m not too familiar w/ *nix makefiles so I figured the Windows version might be a better choice, but given that I have a mix of C/C++ code, it seems that the compiler wants a pure C file when I compile using the -ta=nvidia,cc20 flag.

So I guess I just need the right syntax to get the project to compile correctly using the PGI Accelerator compiler.

My first try was just the most basic… point it to my main entry file as suggested by the documentation:

pgcc MAIN.cpp -ta=nvidia,cc20 -Minline

which gives:
‘Please use pgcpp for C++ source files’

So I attempted that, but then am met with the fact that pgcpp does not support compiling with the -ta flag, which I think pretty much is what I want in the first place – to compile the code to be run on the GPU.

If I just rename my .cpp files to .c files, and do:
pgcc MAIN.cpp -ta=nvidia,cc20 -Minline

I just get a a barrage of errors related to ignoring/recovering from :'s, presumably because these are really C++ functions and pgcc does not know seem to how to deal with them.

I was actually trying to search for the Visual Studio plugin I saw referenced on your site, but did not find it. I currently have VS2010, but no real reason why I can’t install 2008 to test your plugin, so I’d appreciate to try that out and see if it works out for me.

I have just sent the (12:26 Eastern) code to the Customer Support e-mail so you can see better what I’m working with and advise accordingly. The only change I’ve done from the original OpenMP version is replacing the pragma openmp to pragma acc statements so far. It still compiles in VS2010, despite ignoring the unknown pragmas.

MatColgrove · July 27, 2012, 4:30pm

Currently, we only support OpenACC/PGI Accelerator Model with C and Fortran. Though, we should have C++ support next year. In the mean time, you will need to extract the accelerator code from C++ portion and put into a C routine.

I was actually trying to search for the Visual Studio plugin I saw referenced on your site, but did not find it. I currently have VS2010, but no real reason why I can’t install 2008 to test your plugin, so I’d appreciate to try that out and see if it works out for me.

We don’t advertise that we have it since we’re still gauging if there is a demand for it. I’ll ask that you be contacted and get you a copy. Note that is currently C only so you would have to port your code.

I have just sent the code to the Customer Support e-mail so you can see better what I’m working with

Got it. I’ll look into it.

Mat

LET · July 27, 2012, 4:59pm

Thanks!

MatColgrove · July 27, 2012, 6:06pm

Hi Luis,

In looking at your code, if you were able rewrite the “calc_d_Ap_j” routine to C, then the loops shouldn’t be too difficult to accelerate. The hardest part looks to be passing in the class data members which this routine wont have direct access to.

Once ported, your first region would look some thing like this:

#pragma acc data copy(...) // add data clauses
{
#pragma acc parallel loop gang
                for (ap=0; ap<aNgO; ap++){

                        cont = 0;

#pragma acc loop vector
                        for (am=0; am<ap; am++)
                                cont += negObs[am];

#pragma acc loop vector
                        for (an=0; an<aNgS; an++){

                                .. init reduction vars.

                                for (am=0; am<negObs[ap]; am++){
                                    ... reduction code
                                    ... note the dependency on cont
                                }
                                .. store reduction vars back to global array
                        }
                 }
}

Though, you might get better performance by precomputing the cont for each ap index. That way you can break the dependency between the loops and allow for more parallelization.

#pragma acc data copy(...) // add data clauses
{
#pragma acc parallel 
{
#pragma acc loop gang
                for (ap=0; ap<aNgO; ap++){

                        contarr[ap] = 0;
#pragma acc loop vector
                        for (am=0; am<ap; am++)
                                contarr[ap] += negObs[am];
                }
}
#pragma acc parallel 
{
#pragma acc loop gang collapse(2)
                for (ap=0; ap<aNgO; ap++){
                        for (an=0; an<aNgS; an++){
                                cont = contarr[ap];
                                .. init reduction vars.
#pragma acc loop vector
                                for (am=0; am<negObs[ap]; am++){
                                    ... reduction code
                                }
                                .. store reduction vars back to global array
                        }
                 }
}

This way you can parallelize both the ap and an loops in a 2-D gang (i.e. a CUDA block) and then vectorize the inner am loop. Before, you could only vectorize the first am loop and the an loop, and put the ap loop in a 1-D gang.

Granted, this probably doesn’t make much sense to you yet. But hopefully it will soon.

Mat

LET · July 27, 2012, 6:19pm

Thanks for the insight, so if I were to just use just C syntax for the code and put everything in one main file, I would be able to use your suggestions, correct?

MatColgrove · July 27, 2012, 9:06pm

so if I were to just use just C syntax for the code and put everything in one main file, I would be able to use your suggestions, correct?

If you rewrote the code in C, then yes, you would be able to. Though, it wouldn’t need to be all put into a main file.

You would need to modify the class methods into a set of C routines. The class data members may be a bit of a problem. Normally you’d wrap these up in a struct, but only fixed size structs can currently be used OpenACC since data must be contiguous. Granted, you only have one instance of your class, so you don’t really need wrap the data into a struct.

I’m sure there will be other issues as well so ask if you get stuck.

Mat

vacaloca · July 31, 2012, 9:13pm

I have no problem translating all the source to C, except for the part where I read the files – I actually need variable arrays since I am reading variable length files. I’m a bit stuck as to how to leave those methods intact and process them using pgcpp and do the rest using pgcc.

I tried an example that seemed promising posted here posted by NMTop40: calling c++ function from c – however that did not seem to work with pgcc, either.

Could you provide some insight or perhaps a template into mixing C and C++ code under pgcc/pgcpp? I was hoping to use the VS C plugin, but the links I was sent do not work, so I’m stuck at the moment, unfortunately.

Edit: In the meantime I will just hard code the values to the sample data files until I see how I can accomplish the C/C++ mix… at least I can get some progress this way.

Edit 2: Seems like pgcc compiles matrices initialized with variable names, after all, so it might not be an issue, just VS2010 doesn’t want to have any of that when compiling a *.c file, which is probably a good thing… anyway, I finally got the VS C plugin… will port out code and see where I get stuck next, haha.

MatColgrove · August 1, 2012, 8:45pm

Hi Luis,

FYI, we have whole chapter in our user’s devoted to Inter-language calling, including C to C++ and C++ to C that you might find useful. (See Chapter 13 of PGI Documentation Archive for Versions Prior to 17.7)

anyway, I finally got the VS C plugin… will port out code and see where I get stuck next, haha.

Good. Let us know how it goes.

Mat

LET · August 3, 2012, 3:59am

ehehe… I should RTFM more often. :)

Anyway, I managed to port the code to C that compiles under pgcc and just stuck it all in one file just to make it easier for myself in the meantime. The issue is that when I tried the pragmas suggested I did not get the right outputs – in fact I ended up getting no output at all, results were still initialized to zero, or were orders of magnitude off. When the pragmas are not in place, I get the exact same output I get with compiling with MSVC 2008/2010, which is good, because I know the code is working.

I do want to mention however, that even without pragmas, the code executes in ~17 seconds for the same data set with a single CPU thread, vs 58 seconds for the MSVC version, but regardless of that, OpenMP runs it in ~9 secs.

I have sent the ported code to the support e-mail in case any suggestions can be made in regards to why the pragma additions are not producing the correct outputs. Perhaps there were some implied changes mentioned in the post that I did not implement?

I compiled as:

pgcc -Minline -ta=nvidia,cc20 -acc -Minfo file.c

Here is the output I got when I did the pragmas as suggested with -Minfo flag and I get the bogus results:

PGC-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Mismatched loop levels when adding syn
cs (ifmm.c: 466)
main:
    368, time inlined, size=2, file ifmm.c (132)
    462, Generating present_or_copy(aEph[0:Nobs])
         Generating present_or_copy(aEth[0:Nobs])
         Generating present_or_copy(aCobsZ[0:NgO])
         Generating present_or_copy(aCobsY[0:NgO])
         Generating present_or_copy(aCobsX[0:NgO])
         Generating present_or_copy(acs32[0:Nobs])
         Generating present_or_copy(acs23[0:Nobs])
         Generating present_or_copy(acs22[0:Nobs])
         Generating present_or_copy(acs13[0:Nobs])
         Generating present_or_copy(acs12[0:Nobs])
         Generating present_or_copy(akuZ[0:NgONgS])
         Generating present_or_copy(akuY[0:NgONgS])
         Generating present_or_copy(akuX[0:NgONgS])
         Generating present_or_copy(arz[0:Nobs])
         Generating present_or_copy(ary[0:Nobs])
         Generating present_or_copy(arx[0:Nobs])
         Generating present_or_copy(aRadio[0:NgONgS])
         Generating present_or_copy(aAggregation_Jz[0:NgONgS])
         Generating present_or_copy(aAggregation_Jy[0:NgONgS])
         Generating present_or_copy(aAggregation_Jx[0:NgONgS])
         Generating present_or_copy(negObs[0:NgO])
         Generating present_or_copy(aNgS)
         Generating present_or_copy(aNgO)
         Generating present_or_copy(afactorJ)
         Generating present_or_copy(k0)
         Generating present_or_copy(coefEcartEphZ)
         Generating present_or_copy(coefEcartEthZ)
         Generating present_or_copy(coefEcartEphY)
         Generating present_or_copy(coefEcartEthY)
         Generating present_or_copy(coefEcartEphX)
         Generating present_or_copy(prodEscal)
         Generating present_or_copy(vvar4)
         Generating present_or_copy(vvar3)
         Generating present_or_copy(vvar2)
         Generating present_or_copy(vvar1)
         Generating present_or_copy(aux6)
         Generating present_or_copy(aux5)
         Generating present_or_copy(aux4)
         Generating present_or_copy(aux3)
         Generating present_or_copy(aux2)
         Generating present_or_copy(aux1)
         Generating present_or_copy(ap)
         Generating present_or_copy(an)
         Generating present_or_copy(am)
         Generating present_or_copy(cont)
         Generating present_or_copy(accumZ)
         Generating present_or_copy(accumY)
         Generating present_or_copy(accumX)
    466, Accelerator kernel generated
        468, #pragma acc loop gang /* blockIdx.x */
        476, #pragma acc loop vector(256) /* threadIdx.x */
        481, #pragma acc loop vector(256) /* threadIdx.x */
    468, Scalar last value needed after loop for 'prodEscal' at line 619
         Scalar last value needed after loop for 'prodEscal' at line 620
         Scalar last value needed after loop for 'prodEscal' at line 621
         Scalar last value needed after loop for 'aux3' at line 625
         Scalar last value needed after loop for 'aux3' at line 631
         Scalar last value needed after loop for 'aux3' at line 681
         Scalar last value needed after loop for 'aux6' at line 625
         Scalar last value needed after loop for 'aux6' at line 577
         Scalar last value needed after loop for 'aux6' at line 578
         Scalar last value needed after loop for 'aux2' at line 624
         Scalar last value needed after loop for 'aux2' at line 630
         Scalar last value needed after loop for 'aux2' at line 681
         Scalar last value needed after loop for 'aux5' at line 624
         Scalar last value needed after loop for 'aux5' at line 574
         Scalar last value needed after loop for 'aux5' at line 575
         Scalar last value needed after loop for 'aux1' at line 623
         Scalar last value needed after loop for 'aux1' at line 629
         Scalar last value needed after loop for 'aux1' at line 567
         Scalar last value needed after loop for 'aux1' at line 568
         Scalar last value needed after loop for 'aux1' at line 569
         Scalar last value needed after loop for 'aux1' at line 681
         Scalar last value needed after loop for 'aux4' at line 623
         Scalar last value needed after loop for 'aux4' at line 571
         Scalar last value needed after loop for 'aux4' at line 572
         Accelerator restriction: scalar variable live-out from loop: accumX
         Accelerator restriction: scalar variable live-out from loop: aux4
         Accelerator restriction: scalar variable live-out from loop: aux1
         Accelerator restriction: scalar variable live-out from loop: coefEcartEphX
         Accelerator restriction: scalar variable live-out from loop: accumY
         Accelerator restriction: scalar variable live-out from loop: aux5
         Accelerator restriction: scalar variable live-out from loop: aux2
         Accelerator restriction: scalar variable live-out from loop: coefEcartEphY
         Accelerator restriction: scalar variable live-out from loop: coefEcartEthY
         Accelerator restriction: scalar variable live-out from loop: accumZ
         Accelerator restriction: scalar variable live-out from loop: aux6
         Accelerator restriction: scalar variable live-out from loop: aux3
         Accelerator restriction: scalar variable live-out from loop: coefEcartEphZ
         Accelerator restriction: scalar variable live-out from loop: coefEcartEthZ
         Accelerator restriction: scalar variable live-out from loop: prodEscal
         Accelerator restriction: scalar variable live-out from loop: vvar4
         Accelerator restriction: scalar variable live-out from loop: vvar3
         Accelerator restriction: scalar variable live-out from loop: vvar2
         Accelerator restriction: scalar variable live-out from loop: vvar1
         Accelerator restriction: scalar variable live-out from loop: am
         Accelerator restriction: scalar variable live-out from loop: cont
         Accelerator restriction: scalar variable live-out from loop: an
         Conditional loop will be executed in scalar mode
    476, Accelerator restriction: induction variable live-out from loop: ap
    477, Accelerator restriction: induction variable live-out from loop: am
         Accelerator restriction: induction variable live-out from loop: ap
    481, Complex loop carried dependence of '*(aAggregation_Jz).real' prevents parallelization
         Loop carried dependence of '*(aAggregation_Jz).real' prevents parallelization
         Complex loop carried dependence of '*(aAggregation_Jz).imag' prevents parallelization
         Complex loop carried dependence of '*(aAggregation_Jy).real' prevents parallelization
         Loop carried dependence of '*(aAggregation_Jy).real' prevents parallelization
         Complex loop carried dependence of '*(aAggregation_Jy).imag' prevents parallelization
         Complex loop carried dependence of '*(aAggregation_Jx).real' prevents parallelization
         Loop carried dependence of '*(aAggregation_Jx).real' prevents parallelization
         Complex loop carried dependence of '*(aAggregation_Jx).imag' prevents parallelization
         Complex loop carried dependence of '*(arx)' prevents parallelization
         Complex loop carried dependence of '*(aCobsX)' prevents parallelization
         Complex loop carried dependence of '*(ary)' prevents parallelization
         Complex loop carried dependence of '*(aCobsY)' prevents parallelization
         Complex loop carried dependence of '*(arz)' prevents parallelization
         Complex loop carried dependence of '*(aCobsZ)' prevents parallelization
         Complex loop carried dependence of '*(akuZ)' prevents parallelization
         Complex loop carried dependence of '*(akuY)' prevents parallelization
         Complex loop carried dependence of '*(akuX)' prevents parallelization
         Scalar last value needed after loop for 'prodEscal' at line 619
         Scalar last value needed after loop for 'prodEscal' at line 620
         Scalar last value needed after loop for 'prodEscal' at line 621
         Complex loop carried dependence of '*(acs32)' prevents parallelization
         Complex loop carried dependence of '*(acs22)' prevents parallelization
         Complex loop carried dependence of '*(acs12)' prevents parallelization
         Complex loop carried dependence of '*(acs23)' prevents parallelization
         Complex loop carried dependence of '*(acs13)' prevents parallelization
         Complex loop carried dependence of '*(aEph).real' prevents parallelization
         Complex loop carried dependence of '*(aEth).real' prevents parallelization
         Scalar last value needed after loop for 'aux3' at line 625
         Scalar last value needed after loop for 'aux3' at line 631
         Scalar last value needed after loop for 'aux3' at line 681
         Complex loop carried dependence of '*(aEph).imag' prevents parallelization
         Complex loop carried dependence of '*(aEth).imag' prevents parallelization
         Scalar last value needed after loop for 'aux6' at line 625
         Scalar last value needed after loop for 'aux6' at line 577
         Scalar last value needed after loop for 'aux6' at line 578
         Scalar last value needed after loop for 'aux2' at line 624
         Scalar last value needed after loop for 'aux2' at line 630
         Scalar last value needed after loop for 'aux2' at line 681
         Scalar last value needed after loop for 'aux5' at line 624
         Scalar last value needed after loop for 'aux5' at line 574
         Scalar last value needed after loop for 'aux5' at line 575
         Scalar last value needed after loop for 'aux1' at line 623
         Scalar last value needed after loop for 'aux1' at line 629
         Scalar last value needed after loop for 'aux1' at line 567
         Scalar last value needed after loop for 'aux1' at line 568
         Scalar last value needed after loop for 'aux1' at line 569
         Scalar last value needed after loop for 'aux1' at line 681
         Scalar last value needed after loop for 'aux4' at line 623
         Scalar last value needed after loop for 'aux4' at line 571
         Scalar last value needed after loop for 'aux4' at line 572
         Accelerator restriction: scalar variable live-out from loop: accumX
         Accelerator restriction: scalar variable live-out from loop: aux4
         Accelerator restriction: scalar variable live-out from loop: aux1
         Accelerator restriction: scalar variable live-out from loop: coefEcartEphX
         Accelerator restriction: scalar variable live-out from loop: accumY
         Accelerator restriction: scalar variable live-out from loop: aux5
         Accelerator restriction: scalar variable live-out from loop: aux2
         Accelerator restriction: scalar variable live-out from loop: coefEcartEphY
         Accelerator restriction: scalar variable live-out from loop: coefEcartEthY
         Accelerator restriction: scalar variable live-out from loop: accumZ
         Accelerator restriction: scalar variable live-out from loop: aux6
         Accelerator restriction: scalar variable live-out from loop: aux3
         Accelerator restriction: scalar variable live-out from loop: coefEcartEphZ
         Accelerator restriction: scalar variable live-out from loop: coefEcartEthZ
         Accelerator restriction: scalar variable live-out from loop: prodEscal
         Accelerator restriction: scalar variable live-out from loop: vvar4
         Accelerator restriction: scalar variable live-out from loop: vvar3
         Accelerator restriction: scalar variable live-out from loop: vvar2
         Accelerator restriction: scalar variable live-out from loop: vvar1
         Accelerator restriction: scalar variable live-out from loop: am
    488, Accelerator restriction: induction variable live-out from loop: ap
         Scalar last value needed after loop for 'prodEscal' at line 619
         Scalar last value needed after loop for 'prodEscal' at line 620
         Scalar last value needed after loop for 'prodEscal' at line 621
         Scalar last value needed after loop for 'aux3' at line 625
         Scalar last value needed after loop for 'aux3' at line 631
         Scalar last value needed after loop for 'aux3' at line 681
         Scalar last value needed after loop for 'aux6' at line 625
         Scalar last value needed after loop for 'aux6' at line 577
         Scalar last value needed after loop for 'aux6' at line 578
         Scalar last value needed after loop for 'accumZ' at line 534
         Scalar last value needed after loop for 'accumZ' at line 535
         Scalar last value needed after loop for 'aux2' at line 624
         Scalar last value needed after loop for 'aux2' at line 630
         Scalar last value needed after loop for 'aux2' at line 681
         Scalar last value needed after loop for 'aux5' at line 624
         Scalar last value needed after loop for 'aux5' at line 574
         Scalar last value needed after loop for 'aux5' at line 575
         Scalar last value needed after loop for 'accumY' at line 537
         Scalar last value needed after loop for 'accumY' at line 538
         Scalar last value needed after loop for 'aux1' at line 623
         Scalar last value needed after loop for 'aux1' at line 629
         Scalar last value needed after loop for 'aux1' at line 567
         Scalar last value needed after loop for 'aux1' at line 568
         Scalar last value needed after loop for 'aux1' at line 569
         Scalar last value needed after loop for 'aux1' at line 681
         Scalar last value needed after loop for 'aux4' at line 623
         Scalar last value needed after loop for 'aux4' at line 571
         Scalar last value needed after loop for 'aux4' at line 572
         Scalar last value needed after loop for 'accumX' at line 540
         Scalar last value needed after loop for 'accumX' at line 541
         Accelerator restriction: scalar variable live-out from loop: accumX
         Accelerator restriction: scalar variable live-out from loop: aux4
         Accelerator restriction: scalar variable live-out from loop: aux1
         Accelerator restriction: scalar variable live-out from loop: coefEcartEphX
         Accelerator restriction: scalar variable live-out from loop: accumY
         Accelerator restriction: scalar variable live-out from loop: aux5
         Accelerator restriction: scalar variable live-out from loop: aux2
         Accelerator restriction: scalar variable live-out from loop: coefEcartEphY
         Accelerator restriction: scalar variable live-out from loop: coefEcartEthY
         Accelerator restriction: scalar variable live-out from loop: accumZ
         Accelerator restriction: scalar variable live-out from loop: aux6
         Accelerator restriction: scalar variable live-out from loop: aux3
         Accelerator restriction: scalar variable live-out from loop: coefEcartEphZ
         Accelerator restriction: scalar variable live-out from loop: coefEcartEthZ
         Accelerator restriction: scalar variable live-out from loop: prodEscal
         Accelerator restriction: scalar variable live-out from loop: vvar4
         Accelerator restriction: scalar variable live-out from loop: vvar3
         Accelerator restriction: scalar variable live-out from loop: vvar2
         Accelerator restriction: scalar variable live-out from loop: vvar1
    490, Accelerator restriction: induction variable live-out from loop: ap
         Accelerator restriction: induction variable live-out from loop: am
    491, Accelerator restriction: induction variable live-out from loop: ap
         Accelerator restriction: induction variable live-out from loop: am
    492, Accelerator restriction: induction variable live-out from loop: ap
         Accelerator restriction: induction variable live-out from loop: am
    494, Accelerator restriction: induction variable live-out from loop: an
         Accelerator restriction: induction variable live-out from loop: ap
    500, Accelerator restriction: induction variable live-out from loop: an
         Accelerator restriction: induction variable live-out from loop: ap
         Accelerator restriction: induction variable live-out from loop: am
    502, Accelerator restriction: induction variable live-out from loop: am
         Accelerator restriction: induction variable live-out from loop: an
         Accelerator restriction: induction variable live-out from loop: ap
    504, Accelerator restriction: induction variable live-out from loop: am
    505, Accelerator restriction: induction variable live-out from loop: am
    506, c_prod inlined, size=6, file ifmm.c (694)
         511, Accelerator restriction: induction variable live-out from loop: an
              Accelerator restriction: induction variable live-out from loop: ap
              Accelerator restriction: induction variable live-out from loop: am
         513, Accelerator restriction: induction variable live-out from loop: an
              Accelerator restriction: induction variable live-out from loop: ap
              Accelerator restriction: induction variable live-out from loop: am
         515, Accelerator restriction: induction variable live-out from loop: am
         516, Accelerator restriction: induction variable live-out from loop: am
    517, c_prod inlined, size=6, file ifmm.c (694)
         522, Accelerator restriction: induction variable live-out from loop: an
              Accelerator restriction: induction variable live-out from loop: ap
              Accelerator restriction: induction variable live-out from loop: am
         524, Accelerator restriction: induction variable live-out from loop: an
              Accelerator restriction: induction variable live-out from loop: ap
              Accelerator restriction: induction variable live-out from loop: am
         526, Accelerator restriction: induction variable live-out from loop: am
         527, Accelerator restriction: induction variable live-out from loop: am
    528, c_prod inlined, size=6, file ifmm.c (694)
         532, Accelerator restriction: induction variable live-out from loop: am
              Accelerator restriction: induction variable live-out from loop: ap
    534, Accelerator restriction: induction variable live-out from loop: an
         Accelerator restriction: induction variable live-out from loop: ap
    535, Accelerator restriction: induction variable live-out from loop: an
         Accelerator restriction: induction variable live-out from loop: ap
    537, Accelerator restriction: induction variable live-out from loop: an
         Accelerator restriction: induction variable live-out from loop: ap
    538, Accelerator restriction: induction variable live-out from loop: an
         Accelerator restriction: induction variable live-out from loop: ap
    540, Accelerator restriction: induction variable live-out from loop: an
         Accelerator restriction: induction variable live-out from loop: ap
    541, Accelerator restriction: induction variable live-out from loop: an
         Accelerator restriction: induction variable live-out from loop: ap
    543, Accelerator restriction: induction variable live-out from loop: an
    544, Accelerator restriction: induction variable live-out from loop: ap
    567, c_prod inlined, size=6, file ifmm.c (694)
    568, c_prod inlined, size=6, file ifmm.c (694)
    569, c_prod inlined, size=6, file ifmm.c (694)
    619, c_prod inlined, size=6, file ifmm.c (694)
    620, c_prod inlined, size=6, file ifmm.c (694)
    621, c_prod inlined, size=6, file ifmm.c (694)
    637, time inlined, size=2, file ifmm.c (132)
    653, difftime inlined, size=2, file ifmm.c (83)
    691, Accelerator restriction: induction variable live-out from loop: ap
PGC/x86-64 Windows 12.6-0: compilation completed with warnings

MatColgrove · August 3, 2012, 9:03pm

Hi Luis,

I’m looking at the code right now. The “scalar last value needed” is because the address of the these variables are being passed into the “c_prod” routine. The compiler must assume that the value could be stored in a variable needed after the end of the compute region. The work around is to use the “private” clause to force the compiler to use a private copy.

The “Complex loop carried dependence” are because the compiler can’t tell if your array accesses are unique since you use a look-up table to get the index. The compiler must assume that duplicate elements are used, and hence the loop is not parallel. When using the kernels method, you can use the “loop independent” clause to have the compiler ignore this dependency by asserting that they are independent. Using the parallel method, “independent” is implied and these warnings don’t effect kernel generation. However, if the loop up table does contain duplicate index, you may get wrong answers.

Mat