Unable to import PGPROF generated profile data

Dear all,
I have a problem using PGPROF to read profile data. My code is written in Fortran 90+OpenMPI. I added some OpenACC directives to my code. The CUDA Toolkit I used is CUDA9.0. Here is the script I used to submit my job:

#!/bin/bash

#PBS -N test7
#PBS -l select=1:ncpus=16:mpiprocs=16:ngpus=4:mem=120GB
#PBS -l walltime=24:00:00
#PBS -A hpce3__guo
#PBS -j oe
##PBS -q prod
#PBS -W group_list=hpce3__guo
#PBS -o log.txt
#ARGS=‘1380’
QSUB=‘/galileo/cineca/sysprod/pbs/default/bin/qsub’

###################################################

Fix the SGE environment-handling bug (bash):

#source /usr/share/Modules/init/sh
#export -n -f module

set path to mpirun

#module add mpi/openmpi-1.5.4-gcc-4.6.1
#module add mpi/openmpi-1.6.5-intel-14.0.2

export PGI_ACC_TIME=1
module load profile/advanced
module load pgi/17.10
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/cineca/prod/compilers/pgi/17.10/none/linux86-64/2017/cuda/9.0/lib64
###################################################

The command to run with mpiexec:

DIR=/galileo/home/userexternal/wguo0000/test7_convdiff_optimize_CUDA9.0/
cd $DIR

The MPI command to run:

mpirun pgprof --unified-memory-profiling off -o nvprof.%p.out $DIR/incompact3d
#mpirun -np 16 $DIR/incompact3d >> log.txt
#mpirun pgprof -o nvprof.%p.out $DIR/incompact3d

################################################################################

The code runs well on GPU and it will also generate profile data. But when I try to import the profile data into PGPROF, the following error will occur:
“Unable to import PGPROF generated profile data.
You are trying to import unsupported data. Either use updated Visual Profiler or generate data using compatible version”
I tried different CUDA versions such as 7.5 and 8.0 but they all failed.
I also tested PGPROF using a simple code from OpenAcc lectures, the laplace2d.f90. This code doesn’t use Openmpi and the generated profile data can be imported into PGPROF. So I think maybe PGPROF in my cluster has some problem reading profile data from Openmpi code.
Could anybody tell me how to fix this? Thanks in advance!
Best regadrs,
Wentao Guo

Hi Wentao Guo,

When you ran the simple case successfully, did it run on a remote node of the cluster? In your stderr output in the failing case, was there a message that libcupti.so could not be found?

My best guess is that libcupti.so, which is the NVIDIA profiling runtime library, could not be found and therefore the code wasn’t profiled.

The library can either be found with the CUDA installation (typically installed in /opt/cuda-9.0/lib64) or with the PGI compilers in “$PGI/linux86-64/2017/cuda/9.0/lib”. If these directories are visible on the node, try setting LD_LIBRARY_PATH to one of the locations. If not, please contact your site admin to see about getting them installed on the nodes.

-Mat

Dear Mat,
Thanks for your reply.

  1. For the simple code, I just ran it in the shell using the following command:
pgf90 -acc -Minfo=accel -o laplace2d.exe laplace2d.f90
  1. In my job script I already pended the directory which contains libcupti.so to LD_LIBRARY_PATH :
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/cineca/prod/compilers/pgi/17.10/none/linux86-64/2017/cuda/9.0/lib64

If I submit the job without pgprof command, for example:

mpirun -np 16 $DIR/incompact3d >> log.txt

There is no warnings about libcupti.so not found. The output file looks like this:

Accelerator Kernel Timing data
/galileo/home/userexternal/wguo0000/test7_convdiff_optimize_CUDA9.0/source_code/incompact3d.f90
incompact3d NVIDIA devicenum=0
time(us): 33,924
113: data region reached 2 times
113: data copyin transfers: 49
device time(us): total=993 max=99 min=6 avg=20
244: data copyout transfers: 4
device time(us): total=394 max=99 min=97 avg=98
153: update directive reached 30 times
153: data copyin transfers: 60
device time(us): total=23,535 max=1,579 min=95 avg=392
154: compute region reached 30 times
157: kernel launched 30 times
grid: [4x21x8] block: [32x4]
device time(us): total=836 max=29 min=27 avg=27
elapsed time(us): total=31,538 max=1,686 min=374 avg=1,051
154: data region reached 60 times
163: update directive reached 30 times
163: data copyout transfers: 30
device time(us): total=8,166 max=502 min=109 avg=272

When I submit the job using pgprof, for example:

mpirun pgprof --unified-memory-profiling off -o nvprof.%p.out $DIR/incompact3d

There is still no warnings about libcupti.so not found. But I found in the output file, the device time after “Kernel launched 30 times” is missing.

157: kernel launched 30 times
grid: [4x21x8] block: [32x4]
elapsed time(us): total=24,366 max=1,429 min=162 avg=812

And I got this warning:

PGI: CUDA Performance Tools Interface (CUPTI) could not be initialized.
Please disable all profiling tools (including NVPROF) before using PGI_ACC_TIME.

Could you give me some suggestions?
Best regards,
Wentao Guo

Try disabling PGI_ACC_TIME since the two profilers interfere with each other.

Dear Mat,
I tried to disable PGI_ACC_TIME and now $PGI_ACC_TIME is empty. I submitted the job to the cluster. This time the warning disappeared, but I have the same error message after I tried to import the profile data to PGPROF.
Best regards,
Wentao Guo

I am having the same problem.

I get a pgprof.out but I cannot import it into the pgprof.

I talked with our profiler folks about this but still aren’t positive what’s wrong.

One possibility is that the CUDA version used on the cluster is newer that the version which the pgprof you’re using supports. For example if the cluster is using a CUDA 9.0 driver but you’re using a PGI 17.4 pgprof which supports up to CUDA 8.0.

What CUDA driver is installed the system and which version of PGI are you using?

-Mat

Hello Mat,
I used PGI/17.10 and the CUDA version is CUDA 9.0. I didn’t manage to import the output files. But I did the profiling study using the following command:

mpirun pgprof --unified-memory-profiling off --cpu-profiling-scope instruction  --cpu-profiling-mode flat  $DIR/incompact3d > log.txt

It will generate some profiling information in a text file:

======== CPU profiling result (flat):
Time(%) Time Name
33.38% 85.7227s poll (0xa583e02d)
33.36% 85.6627s epoll_wait (0xa583e033)
0.80% 2.05174s opal_timer_linux_get_cycles_sys_timer (…/…/…/…/…/opal/mca/timer/linux/timer_linux_component.c:42 0xa6484005)
0.71% 1.81154s ompi_coll_libnbc_progress (…/…/…/…/…/ompi/mca/coll/libnbc/coll_libnbc_component.c:267 0xa2b37034)
0.41% 1.04088s ompi_coll_libnbc_progress (…/…/…/…/…/ompi/mca/coll/libnbc/coll_libnbc_component.c:267 0xa2b3703d)
0.39% 1.00085s derx
(./derive.f90:75 0x578)
0.39% 990.84ms derx
(./derive.f90:75 0x571)
0.35% 900.76ms memmove_ssse3_back (0xa583f9e2)
0.32% 830.7ms derx
(./derive.f90:69 0x509)
0.31% 800.68ms decomp_2d_mem_merge_xy_real
(./transpose_x_to_y.f90:458 0x143)

Best regards,
Wentao Guo

Hi Wentao Guo,

Do you know the version of the CUDA drivers installed on the cluster? If not, can you please post the output from “nvidia-smi” or “pgaccelinfo” when run on the remote node?

We’re still thinking that it might be a version mismatch in that you’re compiling with CUDA 9.0 but the drivers are older.

If the drivers are older, then you might be able to work around the problem by running “pgprof -ta=8.0 …” if the CUDA Drivers are from CUDA 8.0, or “-ta=7.5” if they are CUDA 7.5.

If the drivers are CUDA 9.0 as well, then I’ll go back to our profiler folks to see if they have other ideas.

-Mat

Dear Mat,
The CUDA driver version is 8.0:

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: “Tesla K80”
CUDA Driver Version / Runtime Version 8.0 / 7.5
CUDA Capability Major/Minor version number: 3.7
Total amount of global memory: 11440 MBytes (11995578368 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Max Clock rate: 824 MHz (0.82 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 384-bit

I tried to use CUDA 7.5 to compile the code:

OPTFC = -acc -Minfo=accel -ta=tesla:cuda7.5 -cpp

But the error still exists. I cannot import the output file to PGPROF.
Best regards,
Wentao Guo

Let’s now go back and rerun your text profile run but now add “-ta=8.0” to the pgprof command.

mpirun pgprof -ta=8.0 --unified-memory-profiling off --cpu-profiling-scope instruction  --cpu-profiling-mode flat  $DIR/incompact3d > log.txt

Normally pgprof automatically detects the CUDA driver version by calling the PGI “pgaccelinfo” utility, But maybe “pgaccelinfo” info is getting corrupted? Can you also try running “pgaccelinfo” on a node as well?

-Mat

Dear Mat,
I ran “pgaccelinfo” on a node and here is a part of the output:

CUDA Driver Version: 9000
NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.81 Sat Sep 2 02:43:11 PDT 2017

Device Number: 0
Device Name: Tesla K80
Device Revision Number: 3.7
Global Memory Size: 11995578368
Number of Multiprocessors: 13
Number of SP Cores: 2496
Number of DP Cores: 832
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 823 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 2505 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 1572864 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc35

So I think may be the CUDA driver version is 9.0 on the node? Then I add -ta=9.0 to the pgprof command you mentioned. I can get the same text profiling result as I did before without -ta=9.0.
Best regards,
Wentao Guo

Can you send the resulting profile to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me? I’ll give it to our profiler folks to see if we can find anything in it that might point to what’s wrong.

Thanks!
Mat

Already sent. Thanks for your help!
Best regards,
Wentao Guo