Hello,
I have a work buffer in my OpenMP program that I want to split up between teams. I have allocated the scratch memory, and then when I get into my #pragma omp target teams section, I use omp_get_team_num() to have each team point to its section of the working memory. I’m observing that omp_get_team_num() is returning 0 for all teams, which makes it impossible for me to split up the working memory by team.
I created a simple test program to illustrate the issue. It seems that any time I use #pragma omp loop inside of the #pragma omp target teams section, that’s what causes omp_get_team_num() to return 0:
test.cpp:
#include <stdlib.h>
#include <omp.h>
#include <stdio.h>
#define NUM_COLS 8192
int main(int argc, char **argv)
{
float * v = (float *) malloc(NUM_COLS*sizeof(float));
float * a = (float *) malloc(NUM_COLS*sizeof(float));
float * b = (float *) malloc(NUM_COLS*sizeof(float));
#pragma omp target teams num_teams(2)
{
//#pragma omp loop // Enabling this pragma causes omp_get_team_num() to always == 0
#pragma omp for // This pragma does not affect omp_get_team_num()
for (int i = 0; i < NUM_COLS; i++)
{
v[i] = b[i] * a[i];
}
if (omp_get_thread_num() == 0)
{
printf("THREAD: %i/%i, TEAM: %i/%i\n", omp_get_thread_num(), omp_get_num_threads(), omp_get_team_num(), omp_get_num_teams());
}
}
}
~
build.sh:
#!/bin/bash
NVCPP=/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/compilers/bin/nvc++
NVCPP_ARGS="-mp=gpu -gpu=managed -Minfo=mp"
#NVCPP_ARGS="-mp=multicore -Minfo=mp"
COMPILER=$NVCPP
ARGS=$NVCPP_ARGS
$COMPILER test.cpp -o test.exe $ARGS
I see results I expect when I use #pragma omp for, with each team getting a team index 0…1:
./build.sh && ./test.exe
main:
14, #omp target teams num_teams(2)
14, Generating "nvkernel_main_F1L14_2" GPU kernel
17, Loop parallelized across threads(128), schedule(static)
20, Barrier
THREAD: 0/1, TEAM: 1/2
THREAD: 0/1, TEAM: 0/2
When I use #pragma omp loop, both teams print that there is only 1 team and both print that they are team #0.
./build.sh && ./test.exe
main:
14, #omp target teams loop num_teams(2)
14, Generating "nvkernel_main_F1L14_2" GPU kernel
Generating NVIDIA GPU code
17, Loop parallelized across teams(2), threads(128) /* blockIdx.x threadIdx.x */
14, Generating Multicore code
17, Loop parallelized across threads
14, Generating implicit map(tofrom:v[:8192])
Generating implicit map(to:b[:8192],a[:8192])
THREAD: 0/1, TEAM: 0/1
THREAD: 0/1, TEAM: 0/1
I want to use #pragma omp loop on my loops in my real program because I want to be able to switch between CPU/GPU on the fly. Am I doing something wrong or is this a bug?
Thanks,
Matt