Omp_get_team_num() returns 0 for all teams with nvc++

Hello,

I have a work buffer in my OpenMP program that I want to split up between teams. I have allocated the scratch memory, and then when I get into my #pragma omp target teams section, I use omp_get_team_num() to have each team point to its section of the working memory. I’m observing that omp_get_team_num() is returning 0 for all teams, which makes it impossible for me to split up the working memory by team.

I created a simple test program to illustrate the issue. It seems that any time I use #pragma omp loop inside of the #pragma omp target teams section, that’s what causes omp_get_team_num() to return 0:

test.cpp:

#include <stdlib.h>
#include <omp.h>
#include <stdio.h>

#define NUM_COLS 8192

int main(int argc, char **argv)
{
    float *  v = (float *) malloc(NUM_COLS*sizeof(float));
    float *  a = (float *) malloc(NUM_COLS*sizeof(float));
    float *  b = (float *) malloc(NUM_COLS*sizeof(float));

    #pragma omp target teams num_teams(2)
    {
       //#pragma omp loop // Enabling this pragma causes omp_get_team_num() to always == 0
       #pragma omp for // This pragma does not affect omp_get_team_num()
       for (int i = 0; i < NUM_COLS; i++)
       {
           v[i] = b[i] * a[i];
       }

       if (omp_get_thread_num() == 0)
       {
           printf("THREAD: %i/%i, TEAM: %i/%i\n", omp_get_thread_num(), omp_get_num_threads(), omp_get_team_num(), omp_get_num_teams());
       }
    }
}
~

build.sh:

#!/bin/bash

NVCPP=/opt/nvidia/hpc_sdk/Linux_x86_64/23.3/compilers/bin/nvc++
NVCPP_ARGS="-mp=gpu -gpu=managed -Minfo=mp"
#NVCPP_ARGS="-mp=multicore -Minfo=mp"

COMPILER=$NVCPP
ARGS=$NVCPP_ARGS

$COMPILER test.cpp -o test.exe $ARGS

I see results I expect when I use #pragma omp for, with each team getting a team index 0…1:

./build.sh && ./test.exe
main:
     14, #omp target teams num_teams(2)
         14, Generating "nvkernel_main_F1L14_2" GPU kernel
         17, Loop parallelized across threads(128), schedule(static)
         20, Barrier

THREAD: 0/1, TEAM: 1/2
THREAD: 0/1, TEAM: 0/2

When I use #pragma omp loop, both teams print that there is only 1 team and both print that they are team #0.

./build.sh && ./test.exe
main:
     14, #omp target teams loop num_teams(2)
         14, Generating "nvkernel_main_F1L14_2" GPU kernel
             Generating NVIDIA GPU code
           17, Loop parallelized across teams(2), threads(128) /* blockIdx.x threadIdx.x */
         14, Generating Multicore code
           17, Loop parallelized across threads
     14, Generating implicit map(tofrom:v[:8192])
         Generating implicit map(to:b[:8192],a[:8192])

THREAD: 0/1, TEAM: 0/1
THREAD: 0/1, TEAM: 0/1

I want to use #pragma omp loop on my loops in my real program because I want to be able to switch between CPU/GPU on the fly. Am I doing something wrong or is this a bug?

Thanks,
Matt

If it helps at all, here’s the output from my main program:

 6, include "fstream"
          38, include "istream"
               38, include "ios"
                    40, include "char_traits.h"
                         39, include "stl_algobase.h"
                              62, include "type_traits.h"
                                  113, #omp target teams loop num_teams(2)
                                      113, Generating "nvkernel__Z17apply_sv_qpe_origPK12type_complexPKfjjjPS__F72L113_2" GPU kernel
                                           Generating NVIDIA GPU code
                                        137, Loop parallelized across teams(2) /* blockIdx.x */
                                        145, Loop parallelized across threads(128) /* threadIdx.x */
                                        152, Loop run sequentially
                                        163, Loop run sequentially
                                        177, Loop run sequentially
                                        187, Loop run sequentially
                                        197, Loop parallelized across threads(128) /* threadIdx.x */
                                             Generating implicit reduction(+:a1,a0)
                                      113, Generating Multicore code
                                        137, Loop parallelized across threads

<DELETED SOME OUTPUT HERE TO HIDE VARIABLE NAMES REFERRED TO IN IMPLICT STATEMENTS>    

THREAD: 0/1, TEAM: 0/1
THREAD: 0/1, TEAM: 0/1

The main code uses #pragma omp loop on all parallelized loops in a #pragma omp target teams identically to how my test program does, with the exception that there are many more loops used since it is a matrix instead of a vector like in the test program. The printf statement is identical to the test.cpp and placed immediately after the initial #pragma omp target teams statement.

Thanks,
Matt

Hi Matt,

I’m checking with our OpenMP team as to what the expected behavior is w.r.t. calling omp_get_[thread|team]_num in a loop context. I seem to recall some restrictions, but don’t remember the details.

I’ll let you know once I hear back.

-Mat

Looks to be a compiler error. We’ve filed a problem report, TPR #33700.

Thanks!
Mat

1 Like

Thanks, Mat. Will be watching for the fix.

So, my assumption is that for now, working/scratch buffers per-team is not possible with #omp loop in the #omp target teams section. With CPU execution, I’d be able to have each thread do a malloc() before it ran its processing code in the OpenMP section, but with GPUs, I’m assuming that would be illegal (even with Unified Memory enabled via compiler flag) since the compiler is trying to translate that section to kernel code. Is that right? Is there another option in the meantime for me to have per-team scratch space when targeting GPUs?

Thanks,
Matt

Correct. I’m not a compiler engineer so don’t presume to understand the details, but the triage of the problem was that we’re using the same method in both context to call these routines but “loop” needs a different method.

With CPU execution, I’d be able to have each thread do a malloc() before it ran its processing code in the OpenMP section, but with GPUs, I’m assuming that would be illegal (even with Unified Memory enabled via compiler flag) since the compiler is trying to translate that section to kernel code. Is that right?

It’s not illegal, just not recommended. You can call malloc from the device, but the performance can be poor since these get serialized and given the default device heap size is small, it’s easy to hit a heap overflow unless you increase the heap size by either calling cudaDeviceSetLimit or via the environment variable “NV_ACC_CUDA_HEAPSIZE”.

Now with UM enabled, it does get problematic in that we replace malloc calls with cudaMallocManaged which can’t be called from the device.

Is the scratch space the same size for all teams? If so, I’d hoist the malloc outside the target region and then add the array to a “private” clause. Even if they aren’t same size, allocating to the largest size may waste a bit of space, but given it’s private to the team, likely it wouldn’t be much.

Even better, if the size of the scratch array is known at compile time, team private arrays will be placed in shared memory (presuming it fits) which often gives better performance.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.