Access to the generated code by OpenMP target pragmas

Hi,
I am working on a project using Nvidia HPC compilers with OpenMP GPU Offloading. The performance of OpenMP Kernels are unexpectedly slow.

We have modified the code by solely adding device and global keywords to use CUDA as a backend and the CUDA kernels run faster than the OpenMP target pragmas. Now my duty is to understand why the generated OpenMP kernels are slower. For this I want to analyze the generated code as compiling with -Minfo gives me for example the following diagnosis:

     34, #omp target
         34, Generating "nvkernel_main_F1L34_2" GPU kernel
     47, FMA (fused multiply-add) instruction(s) generated

How can I have access the generated CUDA code? From what I understand, cuda code is generated as a result of the OpenMP pragma. I tried searching the documentation but I could not understand how I could achieve it.

Thanks in advance!

How can I have access the generated CUDA code?

While early implementations of OpenACC did get translated to CUDA, we now use a LLVM based device code generator for both OpenACC and OpenMP offload. We do still keep the old CUDA code generator around for debugging purposes, but it’s not supported, maintained, nor guaranteed to work correctly.

To access it, use the flag “-gpu=keep,nonvvm” and a “.n001.gpu” file will be generated which you can review. Note that the code produced is very low level CUDA and may not be easy to read.

For the generated LLVM code, use “-gpu=keep”. Again the “.gpu” file will contain the output.

-Mat

Ty for the answer, I will try it.

A follow up question, before I try your suggestion. I have written 2 simple programs one calling a cuda kernel inside a .cu file and a .cpp file using OpenMP pragmas to implement vector addition.

I am initialiazing the memory on the GPU with calls the CUDA libraries in both versions (using is_device_ptr() in omp pragma) and measure the time, native CUDA code is significantly faster even though the kernels require minimal time. I intend to compare the generated .ptx files of both implementations to understand why OpenMP code is significantly slower.

To significant sections I want to add:
For CUDA:

      __global__ void saxpy(int n, float a, float *x, float *y)
      {
          for (int i = blockIdx.x * blockDim.x + threadIdx.x;
               i < n;
               i += blockDim.x * gridDim.x)
          {
              y[i] = (a * x[i]) + y[i];
          }
      }
      ...
    saxpy<<<4096, 128>>>(N, r, d_x, d_y);
    cudaDeviceSynchronize();

and for OpenMP:

#pragma omp target teams distribute parallel for schedule(static, 1) is_device_ptr(d_x, d_y) num_teams(4096) num_threads(128)
  for (size_t i = 0; i < N; i++)
  {
    d_y[i] = (r * d_x[i]) + d_y[i];
  }

Do you have any suggestions on why OpenMP would be slower?

Do you have any suggestions on why OpenMP would be slower?

Not offhand, but if you can provide a complete reproducing example, I can take a look.

Though, I’d suggest you profile the code with Nsight-systems to determine where the time difference is coming from. Is it the kernel or something else?

You may also consider trying the “loop” construct instead of “distribute”. With “distribute”, we need to outline the code and leave the final optimization decisions up to the runtime. With “loop”, the compiler can make better decisions often leading to more performant code. Though with “loop”, you are more constrained on what can be used in the compute region.

For example:

#pragma omp target teams loop is_device_ptr(d_x, d_y)

Thank you for the advice, yes I can give you to both programs and the commands on how I compile them.

OpenMP version: (called omp_kernel.cpp)

#include <stdio.h>
#include <math.h>
#include <chrono>
#include <iostream>

#include "omp.h"
#include <cuda_runtime.h>

// https://leimao.github.io/blog/Proper-CUDA-Error-Checking/
#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
template <typename T>
void check(T err, const char *const func, const char *const file,
           const int line)
{
  if (err != cudaSuccess)
  {
    std::cerr << "CUDA Runtime Error at: " << file << ":" << line
              << std::endl;
    std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
    std::exit(EXIT_FAILURE);
  }
}

int main(void)
{
  constexpr size_t N = 1000000; // 10**6
  constexpr float r = 2.0;
  constexpr float yval = 5.0;
  constexpr float xval = 10.0;

  /* Initialization of vectors */
  std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
  float *x = new float[N];
  float *y = new float[N];
  std::fill(x, x + N, xval);
  std::fill(y, y + N, yval);
  float *d_x;
  float *d_y;
  CHECK_CUDA_ERROR(cudaMalloc((void **)&d_x, sizeof(float) * N));
  CHECK_CUDA_ERROR(cudaMalloc((void **)&d_y, sizeof(float) * N));
  CHECK_CUDA_ERROR(cudaMemcpy((void *)d_x, (void *)x, sizeof(float) * N, cudaMemcpyHostToDevice));
  CHECK_CUDA_ERROR(cudaMemcpy((void *)d_y, (void *)y, sizeof(float) * N, cudaMemcpyHostToDevice));
  std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
  std::cout << "Cuda mempcy: " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " [μs]" << std::endl;

  begin = std::chrono::steady_clock::now();
  /* Dot product of two vectors */
  omp_set_num_teams(4096);
  omp_set_num_threads(512);
#pragma omp target teams distribute parallel for schedule(static, 1) is_device_ptr(d_x, d_y) num_teams(4096) num_threads(128)
  for (size_t i = 0; i < N; i++)
  {
    d_y[i] = (r * d_x[i]) + d_y[i];
  }
  end = std::chrono::steady_clock::now();
  std::cout << "Saxpy: " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " [μs]" << std::endl;

  CHECK_CUDA_ERROR(cudaMemcpy((void *)y, (void *)d_y, sizeof(float) * N, cudaMemcpyDeviceToHost));
  double sum = 0.0;
  for (size_t i = 0; i < N; i++)
  {
    sum += y[i];
  }
  double expected = (((r * xval) * N) + yval * N);
  printf("The sum is: %2.2f (expected: %2.2f)\n", sum, expected);

  return 0;
}

CUDA version: (called cuda_kernel.cu)

#include <algorithm>
#include <chrono>
#include <iostream>
#include <stdio.h>
#include <cuda.h>

// https://leimao.github.io/blog/Proper-CUDA-Error-Checking/
#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
template <typename T>
void check(T err, const char *const func, const char *const file,
           const int line)
{
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
        std::exit(EXIT_FAILURE);
    }
}

__global__ void saxpy(int n, float a, float *x, float *y)
{
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < n;
         i += blockDim.x * gridDim.x)
    {
        y[i] = (a * x[i]) + y[i];
    }
}

int main()
{
    constexpr size_t N = 1000000; // 10**6
    constexpr float r = 2.0;
    constexpr float yval = 5.0;
    constexpr float xval = 10.0;

    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    float *x = new float[N];
    float *y = new float[N];
    std::fill(x, x + N, xval);
    std::fill(y, y + N, yval);
    float *d_x;
    float *d_y;
    CHECK_CUDA_ERROR(cudaMalloc((void **)&d_x, sizeof(float) * N));
    CHECK_CUDA_ERROR(cudaMalloc((void **)&d_y, sizeof(float) * N));
    CHECK_CUDA_ERROR(cudaMemcpy((void *)d_x, (void *)x, sizeof(float) * N, cudaMemcpyHostToDevice));
    CHECK_CUDA_ERROR(cudaMemcpy((void *)d_y, (void *)y, sizeof(float) * N, cudaMemcpyHostToDevice));
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    std::cout << "Cuda mempcy: " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " [μs]" << std::endl;

    begin = std::chrono::steady_clock::now();
    saxpy<<<4096, 128>>>(N, r, d_x, d_y);
    cudaDeviceSynchronize();
    end = std::chrono::steady_clock::now();
    std::cout << "Saxpy: " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " [μs]" << std::endl;

    CHECK_CUDA_ERROR(cudaMemcpy((void *)y, (void *)d_y, sizeof(float) * N, cudaMemcpyDeviceToHost));
    double sum = 0.0;
    for (size_t i = 0; i < N; i++)
    {
        sum += y[i];
    }
    double expected = (((r * xval) * N) + yval * N);
    printf("The sum is: %2.2f (expected: %2.2f)\n", sum, expected);
}

This is how I compile them:

nvc++ -cuda -mp=gpu -Minfo -gpu=cc86 -o omp omp_kernel.cpp

nvc++ -cuda -mp=gpu -Minfo -gpu=cc86 -o cuda cuda_kernel.cu

So far I have only tested the runtime on my local laptop with NVIDIA GeForce RTX 3050.

A notice, when I try to set constexpr size_t N = 1000000; // 10^6 to for example 10^9 I get a segmentation fault, so I have left the N at 10^6 so far)

I was too focused on getting the CUDA representation I forgot to profile the code. I will try loop and profiling next, thank you.

Looking at the profiling information, the time required by the real generated kernel does require roughly the same time for both CUDA and OpenMP variants. But in OpenMP variant, there are called to cuInit, cuMemAlloc etc. that cause additional overhead. I guess this overhead can’t be avoided, am I right?

Looks to me to be some initialization overhead in the OpenMP runtime that only occurs the first time a kernel is launched. I updated the code putting the OpenMP target region inside a loop so it’s called 5 times. Calls after the first, match the CUDA time.

Example:

% nvc++ -mp=gpu -fast omp_kernel.cpp -cuda -o omp.out ; omp.out
Cuda mempcy: 2076529 [μs]
Saxpy: 4210 [μs]
Saxpy: 19 [μs]
Saxpy: 18 [μs]
Saxpy: 17 [μs]
Saxpy: 17 [μs]
The sum is: 105000000.00 (expected: 25000000.00)
% nvcc -O3 -arch=sm_80 cuda_kernel.cu -o cuda.out; ./cuda.out
Cuda mempcy: 2253538 [μs]
Saxpy: 41 [μs]
Saxpy: 16 [μs]
Saxpy: 16 [μs]
Saxpy: 15 [μs]
Saxpy: 15 [μs]
The sum is: 105000000.00 (expected: 25000000.00)

Thanks for the reply. I have also thought of it from analyzing the output from Nsights Systems. Thanks for your help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.