I am trying to offload a CPU winograd convolution implemented formerly in OpenMP.
The compiler is nvhpc@23.9.
Here is the code ( a naive batched gem for example)
void sgemm_batched(const int64_t batch_size,
const int64_t M,
const int64_t N,
const int64_t K,
float *A,
float *B,
float *C) {
typedef float(*A_tensor_t)[M][K];
typedef float(*B_tensor_t)[N][K];
typedef float(*C_tensor_t)[N][M];
A_tensor_t A_tensor = (A_tensor_t)A;
B_tensor_t B_tensor = (B_tensor_t)B;
C_tensor_t C_tensor = (C_tensor_t)C;
#pragma omp target
#pragma omp parallel for collapse(3)
#pragma acc parallel loop collapse(3)
for (int64_t batch = 0; batch < batch_size; ++batch) {
for (int64_t m = 0; m < M; ++m) {
for (int64_t n = 0; n < N; ++n) {
C_tensor[batch][n][m] = 0;
for (int64_t k = 0; k < K; ++k) {
C_tensor[batch][n][m] += A_tensor[batch][m][k] * B_tensor[batch][n][k];
}
}
}
}
}
The makefile (I am notice on the similarity between OpenAcc and OpenMP , so I put their directives together.):
OMP_FLAG = -O3 -g -Wall -mp=gpu,multicore -gpu=cc89,managed -std=c++17 -Minfo=mp
ACC_FLAG = -O1 -g -Wall -acc=gpu,multicore -gpu=cc89,managed,time -std=c++17 -Minfo=acc
# Also , -O3 causes panic during optimization(appears on both nvhpc 23.9 and 24.3), I disabled it
# message: /.../.../24.3/compilers/share/llvm/bin/opt: /tmp/nvc++3hqDdLpcQBm_0.ll:809:23: error: use of undefined value '%_T2_6083.addr'
# %178 = load i64, ptr %_T2_6083.addr, align 8, !tbaa !925, !dbg !997
CXX=nvc++
omp:
${CXX} driver.cc winograd.cc ${OMP_FLAG} -o winograd
acc:
${CXX} driver.cc winograd.cc ${ACC_FLAG} -o winograd
The GPU is a L40, here is a output of nvaccelinfo
CUDA Driver Version: 12060
NVRM version: NVIDIA UNIX x86_64 Kernel Module 560.35.03 Fri Aug 16 21:39:15 UTC 2024
Device Number: 0
Device Name: NVIDIA L40
Device Revision Number: 8.9
Global Memory Size: 47929425920
Number of Multiprocessors: 142
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 2490 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 9001 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 100663296 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Default Target: cc89
The build command : make omp.
The problem: I found it quite slow, so i use nsys profile. The grid sizes of all kernels are (142,1,1), exactly , the same as the number of SMs of L40 (142).
That caused a low occupancy, shown in the section of ncu output below:
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 24
Block Limit Registers block 10
Block Limit Shared Mem block 24
Block Limit Warps block 12
Theoretical Active Warps per SM warp 40
Theoretical Occupancy % 83.33
Achieved Occupancy % 8.32
Achieved Active Warps Per SM warp 3.99
------------------------------- ----------- ------------
How can I fix this? Thanks for responses!

