Openacc Example running slower with GPU

I am going through an example from gpubootcamp. The code runs extremely slowly. It takes longer to execute on gpu than just serially. What am I doing wrong? The code that I am running is below.

And I am compiling with the following command: nvc -acc -gpu=managed -Minfo=accel

I am using a NVIDIA GeForce RTX 3050 Laptop GPU but running the code in a container.

#include <math.h>
#include <stdlib.h>
#include <string.h>

#define OFFSET(x, y, m) (((x)*(m)) + (y))

void initialize(double *restrict A, double *restrict Anew, int m, int n)
{
memset(A, 0, n * m * sizeof(double));
memset(Anew, 0, n * m * sizeof(double));

for(int i = 0; i < m; i++){
    A[i] = 1.0;
    Anew[i] = 1.0;
}

}

double calcNext(double *restrict A, double *restrict Anew, int m, int n)
{
double error = 0.0;
#pragma acc parallel loop reduction(max:error)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[OFFSET(j, i, m)] = 0.25 * ( A[OFFSET(j, i+1, m)] + A[OFFSET(j, i-1, m)]
+ A[OFFSET(j-1, i, m)] + A[OFFSET(j+1, i, m)]);
error = fmax( error, fabs(Anew[OFFSET(j, i, m)] - A[OFFSET(j, i , m)]));
}
}
return error;
}

void swap(double *restrict A, double *restrict Anew, int m, int n)
{
#pragma acc parallel loop
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
A[OFFSET(j, i, m)] = Anew[OFFSET(j, i, m)];
}
}
}

void deallocate(double *restrict A, double *restrict Anew)
{
free(A);
free(Anew);
}

Hi Island_don,

I grabbed the boot camp source from “GitHub - openhackathons-org/gpubootcamp: This repository consists for gpu bootcamp material for HPC and AI” to make sure I’m using the same source and assume you’re looking at lab 1’s laplace2d example. I then applied your changes and do see a significant speed-up so think the code is running as expected. Let’s try and diagnose why you’re not seeing the same.

To confirm you’re generating OpenACC code, please post the full output from the compilation including the compiler feedback messages. Here’s what mine looks like:

lab1% nvc -fast -o laplace jacobi.c laplace2d.c -acc -Minfo=accel -gpu=managed
jacobi.c:
laplace2d.c:
calcNext:
     47, Generating NVIDIA GPU code
         49, #pragma acc loop gang /* blockIdx.x */
             Generating reduction(max:error)
         51, #pragma acc loop vector(128) /* threadIdx.x */
     47, Generating implicit copyin(A[:]) [if not already present]
         Generating implicit copy(error) [if not already present]
         Generating implicit copyout(Anew[:]) [if not already present]
     51, Loop is parallelizable
swap:
     62, Generating NVIDIA GPU code
         64, #pragma acc loop gang /* blockIdx.x */
         66, #pragma acc loop vector(128) /* threadIdx.x */
     62, Generating implicit copyin(Anew[:]) [if not already present]
         Generating implicit copyout(A[:]) [if not already present]
     66, Loop is parallelizable

Second, let’s make sure it’s running on the GPU by setting the environment variables “NV_ACC_TIME=1”. The environment variables enables the OpenACC’s profiler information.
For example:

% setenv NV_ACC_TIME 1
% ./laplace
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 0.551199 s

Accelerator Kernel Timing data
lab1/laplace2d.c
  calcNext  NVIDIA  devicenum=0
    time(us): 230,040
    47: compute region reached 1000 times
        47: kernel launched 1000 times
            grid: [4094]  block: [128]
             device time(us): total=203,011 max=24,840 min=170 avg=203
            elapsed time(us): total=225,786 max=24,894 min=192 avg=225
        47: reduction kernel launched 1000 times
            grid: [1]  block: [256]
             device time(us): total=11,995 max=20 min=11 avg=11
            elapsed time(us): total=34,800 max=50 min=32 avg=34
    47: data region reached 2000 times
        47: data copyin transfers: 1000
             device time(us): total=5,684 max=8 min=5 avg=5
        57: data copyout transfers: 1000
             device time(us): total=9,350 max=39 min=8 avg=9
lab1/laplace2d.c
  swap  NVIDIA  devicenum=0
    time(us): 176,992
    62: compute region reached 1000 times
        62: kernel launched 1000 times
            grid: [4094]  block: [128]
             device time(us): total=176,992 max=184 min=172 avg=176
            elapsed time(us): total=201,796 max=227 min=196 avg=201
    62: data region reached 2000 times

My best guess would be that the GPU isn’t getting recognized in the container and hence the code isn’t running on the GPU, but this info will confirm this. If you don’t see any profile information, from within the container, run the command “nvaccelinfo” or “nvidia-smi” to see if the GPU is found.

-Mat

Here is the output from compilation step:

# nvc -fast -o laplace jacobi.c laplace2d.c -acc -Minfo=accel -gpu=managed
jacobi.c:
laplace2d.c:
calcNext:
     47, Generating NVIDIA GPU code
         49, #pragma acc loop gang /* blockIdx.x */
             Generating reduction(max:error)
         51, #pragma acc loop vector(128) /* threadIdx.x */
     47, Generating implicit copyin(A[:]) [if not already present]
         Generating implicit copy(error) [if not already present]
         Generating implicit copyout(Anew[:]) [if not already present]
     51, Loop is parallelizable
swap:
     62, Generating NVIDIA GPU code
         64, #pragma acc loop gang /* blockIdx.x */
         66, #pragma acc loop vector(128) /* threadIdx.x */
     62, Generating implicit copyin(Anew[:]) [if not already present]
         Generating implicit copyout(A[:]) [if not already present]
     66, Loop is parallelizable
#

Here is the output after setting the environment variable and running:

# export NV_ACC_TIME=1
# ./laplace
libcupti.so not found
Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 99.970177 s

Accelerator Kernel Timing data
/labs/C/source_code/lab1/laplace2d.c
  calcNext  NVIDIA  devicenum=0
    time(us): 105,215
    47: compute region reached 1000 times
        47: kernel launched 1000 times
            grid: [4094]  block: [128]
            elapsed time(us): total=69,995,634 max=83,005 min=65,778 avg=69,995
        47: reduction kernel launched 1000 times
            grid: [1]  block: [256]
            elapsed time(us): total=1,668,182 max=9,789 min=1,424 avg=1,668
    47: data region reached 2000 times
        47: data copyin transfers: 1000
             device time(us): total=28,652 max=309 min=2 avg=28
        57: data copyout transfers: 1000
             device time(us): total=76,563 max=1,312 min=30 avg=76
/labs/C/source_code/lab1/laplace2d.c
  swap  NVIDIA  devicenum=0
    time(us): 0
    62: compute region reached 1000 times
        62: kernel launched 1000 times
            grid: [4094]  block: [128]
            elapsed time(us): total=27,876,954 max=39,571 min=27,542 avg=27,876
    62: data region reached 2000 times
#

Wow, that is bad, though it is running on the device so I was incorrect before. When I run the same code, it only takes 0.5 seconds vs your 100 seconds, so I suspect there’s something going on with the device itself.

Is the device being used for something else like driving a display?
Is it a virtualized device hence you only have a small share of it?

What’s the output from running “nvidia-smi”?

Here is the output from “nvidia-smi”

I should probably be clear about the fact that I am using a windows 11 laptop with ubuntu running via Windows Subsystem for Linux (WSL). And it is from this subsytem linux environment that I am launching the container and running the code.

# nvidia-smi
Mon Jun 13 15:46:31 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.64       Driver Version: 471.80       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   37C    P3    12W /  N/A |    103MiB /  4096MiB |    ERR!      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Could it be that something was not installed correctly?

The GPU shows that it’s in an Error state (i.e. the “ERR!” in the output). Though I have no idea why and this is outside of my area so not sure how to diagnose.

You might try updating your CUDA driver since according to WSL User’s Guide, it needs at least 495 while your driver version in 471.8.

Following up here.

I am not sure why my GPU was in error state but I realized I had multiple cuda drivers installed. I was able to get the gpu out of error mode after uninstalling stuff and leaving only the essentials. However, I was still having poor performance and took the laptop back to the computer store and exchanged it for a different new laptop.

Now that I have a new and different laptop the GPU is not in error state and I have learned that there are two things that definitely still needed to be resolved. First was that I needed to install the WLS version of the Nvidia toolkit. The second thing is that the Nvidia Container Toolkit needed to be installed as well.

This guide was very helpful and explained some of the reasons why my setup was not working.
Guide: WSL2 configuration for GPU support - Anchormen | Data activators.

After all that, the gpu bootcamp example was still performing worst than the cpu. So I am still trying to figure out what is going on.

One interesting fact is that I am now able to run this nbody benchmark:

docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark

The results look good when I run this. However, the openacc code from the gpubootcamp is still not running faster than cpu. When I compare the docker file here nvidia-docker/Dockerfile.ubuntu at main · NVIDIA/nvidia-docker · GitHub (which is probably similar to the one used to generate the docker image uesed in the nbody example) to Docker file used in the setup for the openacc example here gpubootcamp/Dockerfile at master · openhackathons-org/gpubootcamp · GitHub, the docker files’ definitions look very different.

My guess is that the image used for the gpubootcamp set up no longer works or does not work on a windows system running linux through WLS. I will try to modify the docker file to make it more compatible with the one from the nvidia repo.