Failed to use teams distribute in GPU

Hi,

I would like to use GPU to realize the parallel computing.
I use the hpc_sdk 22.1 as a compiler and gcc -7.3.0 with offload feature.
But I failed to use the distribute parallel teams clause .
The compiler message is as follows.

nvvmCompileProgram error 9: NVVM_ERROR_COMPILATION.
Error: /tmp/pgaccemNOmU1qKkHv.gpu (5284, 33): parse use of undefined value '@nvkernel_loglik_q_F1L711_1_F1L713_2'
ptxas /tmp/pgaccemNOmD-xKx_A.ptx, line 1; fatal   : Missing .version directive at start of file '/tmp/pgaccemNOmD-xKx_A.ptx'
ptxas fatal   : Ptx assembly aborted due to errors
NVC++-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (inverseGaussian.c: 1)
NVC++/x86-64 Linux 22.1-0: compilation completed with warnings

However the executable file is still produced, but it resulted in the core dumped.

The compilation order is following

nvc -c inverseGaussian.c LowDiscrepancy.o -mp=gpu  -lgomp -lm -Minfo=all -Mcuda -lgf90 -gpu=cuda11.5 -loffload -O3 -acc=gpu -target=gpu -g

Part of the code is

--- skip ---
#pragma omp target data map(from:Y1[0:n_all],t_mat[0:n_all], method[0:3], \
    mi_vec[0:n], n1[0:1], n2[0:1], seed_set[0:n], pars1[0:n_p],q[0:1],n[0:1]) \
    map(to:fval[0:1],ans_mat[0:(3*n)])
#pragma omp target teams num_teams(9)
{
#pragma omp distribute parallel for simd
for(i=0;i<n;i++){
    printf("[DEBUG] %d omp_get_num_teams()=%d\n",i,omp_get_num_teams());
    int mi=mi_vec[i];
    int iseed;
    double y[m];
    double dt_y[m];
    double ans[3];

    iseed=(int)(seed_set[i] * (double) MAX_MOD);
    ytassign( y, dt_y,  m,  n,  mi, i, Y1, t_mat);
    qmc_int(i,iseed, n1, n2, mi, n_p, q, y, dt_y, pars1, ans, method);
    
    fval+=ans[0];
} // for
}

---  skip ---


By the way the “qmc_int” will call more functions, and I declare them all into target .

The program is work if the teams and distribute pragmas are commented out.
The speed is much faster than one-core but 2 times slower than multicore.
I think GPU and hpc-sdk compiler are really helpful.
Hope the distribute feature can improve the GPU computing ability.

I 've tried any passible combination of target, teams, distribute, and so on.
How to solve the problem?

Thank you very much.
Hsueh Fang

Hi Hsueh Fang,

Looks like you’re getting an undefined value error for “loglik_q”, though I’m not sure what this variable is since I don’t see it in the code snip-it. Is it a global variable that needs to be added to a declare target section?

It could also be a compiler error but I would need a minimal reproducing example to investigate. Would you be able to provide a complete example that exhibits the error?

Thanks,
Mat

Nice to see your reply.

The program is a little complex.
More information is collected on GitHub (GitHub - aihsuehfang/gpu_test).

If these pragmas are commented out, and then there is no error. (no distribution feature )

#pragma omp target teams num_teams(9)
#pragma omp distribute parallel for simd

Thanks,
Hsueh Fang

Thanks, though the git repo seems to only include a readme file, not the source. Are you able to provide the source as well?

Hi,

Oops I forgot to commit!

Attached file includes all the files I mentioned.

Thank you for your help.

Hsueh Fang

Mat Colgrove via NVIDIA Developer Forums <nvidia@discoursemail.com> 於 2022年1月26日 週三 上午12:22寫道:

GPU_example_220122.zip (31.4 KB)

Thanks, but the package appears to be incomplete in that there are calls to “baker_trans” and “ytassign” but the source does not include definitions for these routines.

Also “model_gpu.c” includes a “model_ig.h” header file that is not included. However, I was able to work around the by including “function.h” instead along with adding some standard C header files such as string.h, stdlib.h, and stdio.h.

Finally, I suggest you not link against “-lgomp”. This is the GNU OpenMP runtime library which will intercept the OpenMP API calls and give you odd runtime behavior when using these calls. Better to use our nvomp library which is linked in by default when using the “-mp” flag.

Hi, Mat

Sorry for the incomplete files.
I make up the functions, you mentioned, in functions.c, and included the functions.h in the model_gpu.h (See the attached file GPU_example_220126.zip).

The problem (core dumped) is present no matter compiling with or without the flag -lgomp.

Does the inconsistent version cause the problem?
I found the nvidia-smi is 11.2 (driver) but nvc is 11.5.

Thank you for your help

Hsueh Fang

Mat Colgrove via NVIDIA Developer Forums <nvidia@discoursemail.com> 於 2022年1月26日 週三 上午3:19寫道:

GPU_example_220126.zip (31.3 KB)

Hi Hsueh Fang.

The core issue is that your code includes VLAs in the device code which cause implicit allocation and deallocation. For the “y” and “dt_y”, the solution is to hoist these array out of the target region and then add them to a private clause. For the “sv” array in “qmc_int0”, since this is subroutine, the only work around to explicitly allocate the array rather than rely on implicit allocation.

While calling “malloc” on the device is supported, it is not recommended. mallocs get serialized thus potentially lowering performance. Also, the default heap size on the device is relatively small so it’s easy to get a heap overflow. While the heap size can be increases by calling "cudaDeviceSetLimit " from the host before the target region, I’d still recommend adjusting the code to use fixed size arrays or better yet, hoist these variable out of the routine and add them to the parallel for private clause (as I did with y_dt).

On a side note, the malloc’d arrays in “qmc_int0” are not being free’d at the end of the routine. This will cause a memory leak in your program and potentially lead to runtime errors.

The attached code will get you past the initial problem due to the VLAs, but I have not corrected the used of device mallocs. Also I’m getting an undefined reference error due to “SOBOL_F” not having a device version so was unable to link. However in looking at “SOBOL_F” it will be difficult to offload.

In “INITSOBOL” there’s a number of issues. There are several automatic arrays which are implicitly allocated. Hence like using ‘malloc’, will cause performance and heap size issues. There’s also a few large fixed size array which may cause stack overflows. Finally, the code uses data statements which are not supported on the device since the initialization data is stored statically on the host and inaccessible on the device. These will need to be converted to be initialized at runtime. I’m sure there are other issues as well, but this is where I stopped looking.

functions.c (4.0 KB)