Nvc++ generating 1-dimension grid exactly the number of SMs, causing low performance

I am trying to offload a CPU winograd convolution implemented formerly in OpenMP.
The compiler is nvhpc@23.9.
Here is the code ( a naive batched gem for example)


void sgemm_batched(const int64_t batch_size,
                   const int64_t M,
                   const int64_t N,
                   const int64_t K,
                   float *A,
                   float *B,
                   float *C) {
  typedef float(*A_tensor_t)[M][K];
  typedef float(*B_tensor_t)[N][K];
  typedef float(*C_tensor_t)[N][M];
  A_tensor_t A_tensor = (A_tensor_t)A;
  B_tensor_t B_tensor = (B_tensor_t)B;
  C_tensor_t C_tensor = (C_tensor_t)C;

#pragma omp target
#pragma omp parallel for collapse(3)
#pragma acc parallel loop collapse(3)
  for (int64_t batch = 0; batch < batch_size; ++batch) {
    for (int64_t m = 0; m < M; ++m) {
      for (int64_t n = 0; n < N; ++n) {
        C_tensor[batch][n][m] = 0;
        for (int64_t k = 0; k < K; ++k) {
          C_tensor[batch][n][m] += A_tensor[batch][m][k] * B_tensor[batch][n][k];
        }
      }
    }
  }
}

The makefile (I am notice on the similarity between OpenAcc and OpenMP , so I put their directives together.):

OMP_FLAG = -O3 -g -Wall -mp=gpu,multicore -gpu=cc89,managed -std=c++17 -Minfo=mp
ACC_FLAG = -O1 -g -Wall -acc=gpu,multicore -gpu=cc89,managed,time -std=c++17 -Minfo=acc 
# Also , -O3 causes panic during optimization(appears on both nvhpc 23.9 and 24.3), I disabled it
# message: /.../.../24.3/compilers/share/llvm/bin/opt: /tmp/nvc++3hqDdLpcQBm_0.ll:809:23: error: use of undefined value '%_T2_6083.addr'
#                   %178 = load i64, ptr %_T2_6083.addr, align 8, !tbaa !925, !dbg !997

CXX=nvc++

omp:
	${CXX} driver.cc winograd.cc ${OMP_FLAG} -o winograd

acc:
	${CXX} driver.cc winograd.cc ${ACC_FLAG} -o winograd

The GPU is a L40, here is a output of nvaccelinfo


CUDA Driver Version:           12060
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  560.35.03  Fri Aug 16 21:39:15 UTC 2024

Device Number:                 0
Device Name:                   NVIDIA L40
Device Revision Number:        8.9
Global Memory Size:            47929425920
Number of Multiprocessors:     142
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    2490 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             9001 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 100663296 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     Yes
Preemption Supported:          Yes
Cooperative Launch:            Yes
Default Target:                cc89

The build command : make omp.

The problem: I found it quite slow, so i use nsys profile. The grid sizes of all kernels are (142,1,1), exactly , the same as the number of SMs of L40 (142).

That caused a low occupancy, shown in the section of ncu output below:

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           24
    Block Limit Registers                 block           10
    Block Limit Shared Mem                block           24
    Block Limit Warps                     block           12
    Theoretical Active Warps per SM        warp           40
    Theoretical Occupancy                     %        83.33
    Achieved Occupancy                        %         8.32
    Achieved Active Warps Per SM           warp         3.99
    ------------------------------- ----------- ------------

How can I fix this? Thanks for responses!

My best guess is that it’s due to “teams distribute” missing.

Try

#pragma omp target teams distribute parallel for collapse(3)

or use the “loop” construct:

#pragma omp target teams loop collapse(3)

1 Like

Thank you, it works for me.


void sgemm_batched(const int64_t batch_size,
                   const int64_t M,
                   const int64_t N,
                   const int64_t K,
                   float *A,
                   float *B,
                   float *C) {
  typedef float(*A_tensor_t)[M][K];
  typedef float(*B_tensor_t)[N][K];
  typedef float(*C_tensor_t)[N][M];
  A_tensor_t A_tensor = (A_tensor_t)A;
  B_tensor_t B_tensor = (B_tensor_t)B;
  C_tensor_t C_tensor = (C_tensor_t)C;

#pragma omp target teams loop collapse(3)
#pragma acc parallel loop collapse(3)
  for (int64_t batch = 0; batch < batch_size; ++batch) {
    for (int64_t m = 0; m < M; ++m) {
      for (int64_t n = 0; n < N; ++n) {
        C_tensor[batch][n][m] = 0;
        for (int64_t k = 0; k < K; ++k) {
          C_tensor[batch][n][m] += A_tensor[batch][m][k] * B_tensor[batch][n][k];
        }
      }
    }
  }
}