Access to the generated code by OpenMP target pragmas

budanaz.yakup · March 1, 2023, 12:41pm

Hi,
I am working on a project using Nvidia HPC compilers with OpenMP GPU Offloading. The performance of OpenMP Kernels are unexpectedly slow.

We have modified the code by solely adding device and global keywords to use CUDA as a backend and the CUDA kernels run faster than the OpenMP target pragmas. Now my duty is to understand why the generated OpenMP kernels are slower. For this I want to analyze the generated code as compiling with -Minfo gives me for example the following diagnosis:

     34, #omp target
         34, Generating "nvkernel_main_F1L34_2" GPU kernel
     47, FMA (fused multiply-add) instruction(s) generated

How can I have access the generated CUDA code? From what I understand, cuda code is generated as a result of the OpenMP pragma. I tried searching the documentation but I could not understand how I could achieve it.

Thanks in advance!

MatColgrove · March 1, 2023, 5:38pm

How can I have access the generated CUDA code?

While early implementations of OpenACC did get translated to CUDA, we now use a LLVM based device code generator for both OpenACC and OpenMP offload. We do still keep the old CUDA code generator around for debugging purposes, but it’s not supported, maintained, nor guaranteed to work correctly.

To access it, use the flag “-gpu=keep,nonvvm” and a “.n001.gpu” file will be generated which you can review. Note that the code produced is very low level CUDA and may not be easy to read.

For the generated LLVM code, use “-gpu=keep”. Again the “.gpu” file will contain the output.

-Mat

budanaz.yakup · March 1, 2023, 5:49pm

Ty for the answer, I will try it.

A follow up question, before I try your suggestion. I have written 2 simple programs one calling a cuda kernel inside a .cu file and a .cpp file using OpenMP pragmas to implement vector addition.

I am initialiazing the memory on the GPU with calls the CUDA libraries in both versions (using is_device_ptr() in omp pragma) and measure the time, native CUDA code is significantly faster even though the kernels require minimal time. I intend to compare the generated .ptx files of both implementations to understand why OpenMP code is significantly slower.

To significant sections I want to add:
For CUDA:

      __global__ void saxpy(int n, float a, float *x, float *y)
      {
          for (int i = blockIdx.x * blockDim.x + threadIdx.x;
               i < n;
               i += blockDim.x * gridDim.x)
          {
              y[i] = (a * x[i]) + y[i];
          }
      }
      ...
    saxpy<<<4096, 128>>>(N, r, d_x, d_y);
    cudaDeviceSynchronize();

and for OpenMP:

#pragma omp target teams distribute parallel for schedule(static, 1) is_device_ptr(d_x, d_y) num_teams(4096) num_threads(128)
  for (size_t i = 0; i < N; i++)
  {
    d_y[i] = (r * d_x[i]) + d_y[i];
  }

Do you have any suggestions on why OpenMP would be slower?

MatColgrove · March 1, 2023, 6:02pm

Do you have any suggestions on why OpenMP would be slower?

Not offhand, but if you can provide a complete reproducing example, I can take a look.

Though, I’d suggest you profile the code with Nsight-systems to determine where the time difference is coming from. Is it the kernel or something else?

You may also consider trying the “loop” construct instead of “distribute”. With “distribute”, we need to outline the code and leave the final optimization decisions up to the runtime. With “loop”, the compiler can make better decisions often leading to more performant code. Though with “loop”, you are more constrained on what can be used in the compute region.

For example:

#pragma omp target teams loop is_device_ptr(d_x, d_y)

budanaz.yakup · March 1, 2023, 6:07pm

Thank you for the advice, yes I can give you to both programs and the commands on how I compile them.

OpenMP version: (called omp_kernel.cpp)

#include <stdio.h>
#include <math.h>
#include <chrono>
#include <iostream>

#include "omp.h"
#include <cuda_runtime.h>

// https://leimao.github.io/blog/Proper-CUDA-Error-Checking/
#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
template <typename T>
void check(T err, const char *const func, const char *const file,
           const int line)
{
  if (err != cudaSuccess)
  {
    std::cerr << "CUDA Runtime Error at: " << file << ":" << line
              << std::endl;
    std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
    std::exit(EXIT_FAILURE);
  }
}

int main(void)
{
  constexpr size_t N = 1000000; // 10**6
  constexpr float r = 2.0;
  constexpr float yval = 5.0;
  constexpr float xval = 10.0;

  /* Initialization of vectors */
  std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
  float *x = new float[N];
  float *y = new float[N];
  std::fill(x, x + N, xval);
  std::fill(y, y + N, yval);
  float *d_x;
  float *d_y;
  CHECK_CUDA_ERROR(cudaMalloc((void **)&d_x, sizeof(float) * N));
  CHECK_CUDA_ERROR(cudaMalloc((void **)&d_y, sizeof(float) * N));
  CHECK_CUDA_ERROR(cudaMemcpy((void *)d_x, (void *)x, sizeof(float) * N, cudaMemcpyHostToDevice));
  CHECK_CUDA_ERROR(cudaMemcpy((void *)d_y, (void *)y, sizeof(float) * N, cudaMemcpyHostToDevice));
  std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
  std::cout << "Cuda mempcy: " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " [μs]" << std::endl;

  begin = std::chrono::steady_clock::now();
  /* Dot product of two vectors */
  omp_set_num_teams(4096);
  omp_set_num_threads(512);
#pragma omp target teams distribute parallel for schedule(static, 1) is_device_ptr(d_x, d_y) num_teams(4096) num_threads(128)
  for (size_t i = 0; i < N; i++)
  {
    d_y[i] = (r * d_x[i]) + d_y[i];
  }
  end = std::chrono::steady_clock::now();
  std::cout << "Saxpy: " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " [μs]" << std::endl;

  CHECK_CUDA_ERROR(cudaMemcpy((void *)y, (void *)d_y, sizeof(float) * N, cudaMemcpyDeviceToHost));
  double sum = 0.0;
  for (size_t i = 0; i < N; i++)
  {
    sum += y[i];
  }
  double expected = (((r * xval) * N) + yval * N);
  printf("The sum is: %2.2f (expected: %2.2f)\n", sum, expected);

  return 0;
}

CUDA version: (called cuda_kernel.cu)

#include <algorithm>
#include <chrono>
#include <iostream>
#include <stdio.h>
#include <cuda.h>

// https://leimao.github.io/blog/Proper-CUDA-Error-Checking/
#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
template <typename T>
void check(T err, const char *const func, const char *const file,
           const int line)
{
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
        std::exit(EXIT_FAILURE);
    }
}

__global__ void saxpy(int n, float a, float *x, float *y)
{
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < n;
         i += blockDim.x * gridDim.x)
    {
        y[i] = (a * x[i]) + y[i];
    }
}

int main()
{
    constexpr size_t N = 1000000; // 10**6
    constexpr float r = 2.0;
    constexpr float yval = 5.0;
    constexpr float xval = 10.0;

    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    float *x = new float[N];
    float *y = new float[N];
    std::fill(x, x + N, xval);
    std::fill(y, y + N, yval);
    float *d_x;
    float *d_y;
    CHECK_CUDA_ERROR(cudaMalloc((void **)&d_x, sizeof(float) * N));
    CHECK_CUDA_ERROR(cudaMalloc((void **)&d_y, sizeof(float) * N));
    CHECK_CUDA_ERROR(cudaMemcpy((void *)d_x, (void *)x, sizeof(float) * N, cudaMemcpyHostToDevice));
    CHECK_CUDA_ERROR(cudaMemcpy((void *)d_y, (void *)y, sizeof(float) * N, cudaMemcpyHostToDevice));
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    std::cout << "Cuda mempcy: " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " [μs]" << std::endl;

    begin = std::chrono::steady_clock::now();
    saxpy<<<4096, 128>>>(N, r, d_x, d_y);
    cudaDeviceSynchronize();
    end = std::chrono::steady_clock::now();
    std::cout << "Saxpy: " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " [μs]" << std::endl;

    CHECK_CUDA_ERROR(cudaMemcpy((void *)y, (void *)d_y, sizeof(float) * N, cudaMemcpyDeviceToHost));
    double sum = 0.0;
    for (size_t i = 0; i < N; i++)
    {
        sum += y[i];
    }
    double expected = (((r * xval) * N) + yval * N);
    printf("The sum is: %2.2f (expected: %2.2f)\n", sum, expected);
}

This is how I compile them:

nvc++ -cuda -mp=gpu -Minfo -gpu=cc86 -o omp omp_kernel.cpp

nvc++ -cuda -mp=gpu -Minfo -gpu=cc86 -o cuda cuda_kernel.cu

So far I have only tested the runtime on my local laptop with NVIDIA GeForce RTX 3050.

A notice, when I try to set constexpr size_t N = 1000000; // 10^6 to for example 10^9 I get a segmentation fault, so I have left the N at 10^6 so far)

I was too focused on getting the CUDA representation I forgot to profile the code. I will try loop and profiling next, thank you.

budanaz.yakup · March 1, 2023, 6:25pm

Looking at the profiling information, the time required by the real generated kernel does require roughly the same time for both CUDA and OpenMP variants. But in OpenMP variant, there are called to cuInit, cuMemAlloc etc. that cause additional overhead. I guess this overhead can’t be avoided, am I right?

MatColgrove · March 1, 2023, 9:43pm

Looks to me to be some initialization overhead in the OpenMP runtime that only occurs the first time a kernel is launched. I updated the code putting the OpenMP target region inside a loop so it’s called 5 times. Calls after the first, match the CUDA time.

Example:

% nvc++ -mp=gpu -fast omp_kernel.cpp -cuda -o omp.out ; omp.out
Cuda mempcy: 2076529 [μs]
Saxpy: 4210 [μs]
Saxpy: 19 [μs]
Saxpy: 18 [μs]
Saxpy: 17 [μs]
Saxpy: 17 [μs]
The sum is: 105000000.00 (expected: 25000000.00)
% nvcc -O3 -arch=sm_80 cuda_kernel.cu -o cuda.out; ./cuda.out
Cuda mempcy: 2253538 [μs]
Saxpy: 41 [μs]
Saxpy: 16 [μs]
Saxpy: 16 [μs]
Saxpy: 15 [μs]
Saxpy: 15 [μs]
The sum is: 105000000.00 (expected: 25000000.00)

budanaz.yakup · March 1, 2023, 9:49pm

Thanks for the reply. I have also thought of it from analyzing the output from Nsights Systems. Thanks for your help!

system · March 15, 2023, 9:49pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can NVCC compile and/or generate GPU code using OpenMP? CUDA NVCC Compiler	1	997	February 1, 2022
Enabling OpenMP offload breaks OpenACC code nvc, nvc++ and nvfortran	6	1289	December 1, 2021
Ncu does not detect kernels, ==ERROR== The application returned an error code (11) Nsight Compute kernel , profiling	6	2025	December 13, 2023
How to use OpenMP map directive to map dynamic array inside a struct/class to the GPU? nvc, nvc++ and nvfortran	16	216	January 17, 2025
OpenMP offload w/ CUDA interop: undefined reference to `__fatbinwrap__NV_MODULE_ID' nvc, nvc++ and nvfortran	5	1129	May 22, 2023
Clang Openmp Offloading CUDA Programming and Performance	0	1133	March 30, 2018
Significant speedup of OpenCL vs CUDA CUDA Programming and Performance	23	8543	February 12, 2022
CUDA & openMP Problem with the SDK sample code CUDA Programming and Performance	11	14051	September 12, 2015
How to give cuda code Legacy PGI Compilers	2	4338	November 4, 2015
openMP faster than GPU? CUDA Programming and Performance	2	2035	June 15, 2012

Access to the generated code by OpenMP target pragmas

Related topics