Whether the PGI compiler can be installed in NVIDIA ORIN?

Now I have a few NVIDIA Orins. I want to do GPU software development for OpenACC,But I don’t know if the HPC SDK can be installed in ORIN

We don’t officially support using the NVHPC SDK on Orin systems, but have had reports from a few users that they have installed and used it successfully.

Give it a try and I can try to help if you encounter any issues.

-Mat

Thank you very much. Now I have successfully installed the HPC SDK, but when I tried to compile the OpenACC program, there was a small error.

Failing in Thread:1
call to cuModuleLoadDataEX returned error 209:No binary for GPU

I think it’s the lack of NVIDIA driver.But I don’t know how to modify it. Could you please give me some suggestions

I found the relevant problem how to solve.
According to one post:
Ok, so I don’t have a Nano here, nor do we officially support Nano with the HPC compilers, but you might be able to get this to work by setting the environment variable “CUDA_HOME” to the installation of the system’s CUDA install that matches the driver version and then add the flag “-gpu=cc50” to the compile line so the compiler will target a Maxwell device.

That does sound like one of my posts, but a bit dated.

The 209 error means that the generated binary does not include a binary built for the GPU on this system. By default, the compiler will target the GPU on the system where the binary is compiled. To determine this, it queries the CUDA driver. Though if the driver (libcuda.so) can’t be found, you need to specify the target device’s compute capability via the “-gpu=ccXX” flag.

I believe the current Orin systems have Ampere based GPUs, so the correct flag would be “-gpu=cc80” or “-gpu=cc86”.

You can check what device is installed by running “nvaccelinfo”. If this says that it can’t any devices, it’s the same reason, i.e. it can’t find libcuda.so. To fix, set the environment variable “LD_LIBRARY_PATH” to include the directory in which libcuda.so is located.

Hope this helps,
Mat

I’m sorry. I just saw it. Recently I tried your method again, I was able to compile by -gpu=cc86, but when I tried to

run it, he still said Failing in Thread:1
call to cuModuleLoadDataEX returned error 209:No binary for GPU

Although I successfully compiled and ran acc_c2.c in hpcsdk through -gpu=cc50, it seems that there is no GPU parallelization, and the running time of CPU and GPU is about the same

Although I successfully compiled and ran acc_c2.c in hpcsdk through -gpu=cc50,

Perhaps you have an older Orin? The cc86 recommendation was based on the current generation of Orins.

What GPU is installed? (if you don’t know, run either nvidia-smi or nvaccelinfo).

it seems that there is no GPU parallelization, and the running time of CPU and GPU is about the same

What do the compiler feedback messages say? (i.e. add the flag -Minfo=accel).

Try setting the environment variable “NV_ACC_TIME=1” to get a basic profile to see if it’s offloading to the device.

-Mat

Maybe I can give you more information
When I run nvaccelinfo, it looks like this

orin@orin-desktop:~/Desktop/code$ nvaccelinfo

CUDA Driver Version:           11040
NVRM version:                  NVIDIA UNIX Open Kernel Module for aarch64  35.1.0  Release Build  (buildbrain@mobile-u64-5273-d7000)  We
d Aug 10 20:32:39 PDT 2022

Device Number:                 0
Device Name:                   Orin
Device Revision Number:        8.7
Global Memory Size:            32019161088
Number of Multiprocessors:     16
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1300 MHz
Execution Timeout:             No
Integrated Device:             Yes
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             1300 MHz
Memory Bus Width:              128 bits
L2 Cache Size:                 4194304 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     No
Preemption Supported:          Yes
Cooperative Launch:            Yes
  Multi-Device:                Yes
Default Target:                cc86

When I compile my code, it looks like this

orin@orin-desktop:~/Desktop/code$ nvc -o acc_c2 -acc -gpu=cc50 -Minfo=accel  acc_c2.c
main:
     60, Generating copyout(r[:n]) [if not already present]
         Generating create(s) [if not already present]
         Generating copyin(a[:n]) [if not already present]
         Generating create(c) [if not already present]
         Loop is parallelizable
         Generating Tesla code
         60, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

When I execute my code, it looks like this

orin@orin-desktop:~/Desktop/code$ ./acc_c2
   1000000000 iterations completed
     35033149 microseconds on GPU
     34697369 microseconds on host
Test PASSED

The code to run is as follows:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
#include <openacc.h>
#include <accelmath.h>

#if defined(_WIN32) || defined(_WIN64)
#include <sys/timeb.h>
#define gettime(a) _ftime(a)
#define usec(t1,t2) ((((t2).time-(t1).time)*1000+((t2).millitm-(t1).millitm))*100)
typedef struct _timeb timestruct;
#else
#include <sys/time.h>
#define gettime(a) gettimeofday(a,NULL)
#define usec(t1,t2) (((t2).tv_sec-(t1).tv_sec)*1000000+((t2).tv_usec-(t1).tv_usec))
typedef struct timeval timestruct;
#endif

int main( int argc, char* argv[] )
{
    int n;      /* size of the vector */
    float *a;  /* the vector */
    float *restrict r;  /* the results */
    float *e;  /* expected results */
    float s, c;
    timestruct t1, t2, t3;
    long long cgpu, chost;
    int i, nerrors;
    nerrors = 0;
    if( argc > 1 )
    n = atoi( argv[1] );
    else
    n = 1000000000;
    if( n <= 0 ) n = 1000000000;

    a = (float*)malloc(n*sizeof(float));
    r = (float*)malloc(n*sizeof(float));
    e = (float*)malloc(n*sizeof(float));
    for( i = 0; i < n; ++i ) a[i] = (float)(i+1) * 2.0f;
    /*acc_init( acc_device_nvidia );*/

    gettime( &t1 );
    
    #pragma acc kernels copyin(a[0:n])\
                create(s[0:n],c[0:n])\
                copyout(r[0:n])
    #pragma acc loop
    for( i = 0; i < n; ++i ){
    s = sinf(a[i]);
    c = cosf(a[i]);
    r[i] = s*s + c*c;
    }

    gettime( &t2 );
    cgpu = usec(t1,t2);
    for( i = 0; i < n; ++i ){
    s = sinf(a[i]);
    c = cosf(a[i]);
    e[i] = s*s + c*c;
    }
    gettime( &t3 );
    chost = usec(t2,t3);
    /* check the results */
    for( i = 0; i < n; ++i ) {
        if ( fabsf(r[i] - e[i]) >= 0.000001f ) {
           nerrors++;
        }
    }

    printf( "%13d iterations completed\n", n );
    printf( "%13ld microseconds on GPU\n", cgpu );
    printf( "%13ld microseconds on host\n", chost );
    if ( nerrors != 0 ) {
        printf( "Test FAILED\n");
    } else {
        printf( "Test PASSED\n");
    }
    return 0;
}

Ok, I got access to an Orin system. As your nvaccelinfo output shows, it’s a cc87 device, not cc86 as I originally thought, and why you got the no binary found error message. Using “-gpu=cc87” or leaving the flag off and letting the compiler detect the device, should work around the error.

Like you, I’m not seeing good times from the GPU with this code, but the output from setting NV_ACC_TIME=1 shows the kernel time is quite good. The problem for me was that the device initialization overhead was getting into the timing since this is done the first time an OpenACC construct is encounterd.

Hence I uncommented out line 52, “acc_init( acc_device_nvidia );” so the device is initialized before the timers. This improved things quite a bit. Here’s my output:

% nvc -o acc_c2 -acc -Minfo=accel acc_c2.c -gpu=cc87
main:
     56, Loop is parallelizable
         Generating implicit private(s,c)
         Generating NVIDIA GPU code
         56, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     56, Generating implicit copyout(r[:n]) [if not already present]
         Generating implicit copyin(a[:n]) [if not already present]
orin2:/local/home/mcolgrove% ./acc
acc_c2*   acc_c2.c*
% ./acc_c2
CUPTI ERROR: cuptiActivityEnable(CUPTI_ACTIVITY_KIND_KERNEL) returned: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES,
         at ../../src-cupti/prof_cuda_cupti.c:338.
      1000000 iterations completed
         5961 microseconds on GPU
        23246 microseconds on host
Test PASSED

Accelerator Kernel Timing data
/local/home/mcolgrove/acc_c2.c
  main  NVIDIA  devicenum=0
    time(us): 2,143
    56: compute region reached 1 time
        56: kernel launched 1 time
            grid: [7813]  block: [128]
            elapsed time(us): total=1,043 max=1,043 min=1,043 avg=1,043
    56: data region reached 2 times
        56: data copyin transfers: 1
             device time(us): total=513 max=513 min=513 avg=513
        60: data copyout transfers: 1
             device time(us): total=1,630 max=1,630 min=1,630 avg=1,630

I try to use -gpu=cc87, but the nvc compiler tells me this is unknown keyword cc87, I think it may be the outdated HPCSDK, I use HPCSDK21.9, I don’t know which version you are using
On the other hand, I pass

vim ~/,bashrc

Set

export NV_ACC_TIME=1

To the environment variable, but my output doesn’t seem to give the kernel time

I’m using 23.1. You’ll want to keep your NVHPC version current in order to use newer hardware.

Please download our latest version at: https://developer.nvidia.com/hpc-sdk-downloads

Thank you very much for your help. At present, I have another question: cuda11.8 version is supported by hpcsdk23.1, but cuda version is 11.4 on my orin. Does this mean I need to upgrade my cuda to 11.8 version?

It’s always a good idea to keep your CUDA Driver current, so I’d recommend you update, though it shouldn’t be necessary.

The NVHPC “multi” package includes CUDA 11.0 which you can fallback to use.

Thank you very much. Now I have successfully seen the acceleration effect of OpenACC, which is very significant. I made such a low-level mistake: I did not consider the problem of the HPCSDK version. Thank you for your help in the past two weeks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.