Profiling only partially works

xtz465 · July 19, 2020, 6:48pm

I believe I have the latest version of HPC SDK, and am developing programs for GPU using Ubuntu 18.04 and OpenACC.

If I compile with nvc++ and set PGI_ACC_TIME=1, I get a message

libcupti.so not found

The profile that I get seems to have partial information about execution, copyin and copyout, making it difficult to understand how much overhead went into data transfer between CPU and GPU. If I compile with tesla:managed, I get even less information.

If I unset that environment variable, I don’t get the error message, but neither do I get any profiling information.

I have also tried running nvprof, but then I get “command not found”. It turns out that it is on the /opt path, but if I try to run it using the full path, I still get “command not found”

The result is that I don’t know how to set up the parameters so that I get full profiling without error messages.

MatColgrove · July 20, 2020, 4:12pm

“libcupti.so” is the CUDA driver’s profile runtime library and should be found with your CUDA driver installation. Try setting your LD_LIBRARY_PATH to include “/opt/cuda-10.2/targets/x86_64-linux/lib/” (adjust the exact path to match your install). Without the library, you’ll only see the host side timings.

Note that nvprof has been deprecated. You can still obtain it in older CUDA SDK installs, but the HPC SDK only ships with Nsight Systems and Compiler (located under /path/to/hpcsdk_install/Linux_x86_64/20.5/profilers/).

xtz465 · July 20, 2020, 4:23pm

I did the following:

setenv LD_LIBRARY_PATH /opt/nvidia/hpc_sdk/Linux_x86_64/cuda/11.0/lib64/

Now when I run, I get the following:

PGI: CUDA Performance Tools Interface (CUPTI) could not be initialized.
Please disable all profiling tools (including NVPROF) before using PGI_ACC_TIME.

I was not aware that I was running any other profiling tools.

Here is the ps outcome:

% ps -ef | grep nv
root 361 2 0 Jul12 ? 00:00:00 [nv_queue]
root 362 2 0 Jul12 ? 00:00:00 [nv_queue]
root 363 2 0 Jul12 ? 00:00:00 [nvidia-modeset/]
root 364 2 0 Jul12 ? 00:00:00 [nvidia-modeset/]
root 464 2 0 Jul12 ? 00:00:00 [ext4-rsv-conver]
nvidia-+ 1159 1 0 Jul12 ? 00:00:00 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose
bkeister 1322 23456 0 12:19 pts/3 00:00:00 grep --color nv
root 1504 2 0 Jul12 ? 00:10:54 [irq/64-nvidia]
root 1505 2 0 Jul12 ? 00:00:00 [nvidia]
root 1506 2 0 Jul12 ? 00:14:33 [nv_queue]

My goal is a timing comparison against a code that runs well under OpenMP (it almost fully engages all 8 cores of a Ryzen 7 2700x, with 16 threads). If the OpenACC version competes well with that using an entry-level GT 710 card (that is, if I can tell which pieces might speed up with more CUDA cores, which I have not been able to do so far because the timing numbers don’t add up to the total elapsed time), then I’m considering buying a more powerful GTX card.

I also just tried running Nsight. Without a simple example, it is so complicated to learn and set all the required parameters (paranoid levels, scheduling, IP samples, etc…) without getting lots of error messages that I would rather have command-line summaries.

MatColgrove · July 20, 2020, 5:26pm

What CUDA driver do you have installed? (if you don’t know, you can check by running the ‘nvaccelinfo’ command)

I’m guessing it’s a version mismatch between the CUDA driver you have install and the CUDA 11 libcupti.so. You’ll want to use the libcupti.so that’s the same CUDA version as your driver.

xtz465 · July 20, 2020, 5:47pm

Outcome of nvaccelinfo:

CUDA Driver Version: 10020
NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.100 Fri May 29 08:45:51 UTC 2020

Device Number: 0
Device Name: GeForce GT 710
Device Revision Number: 3.5

I installed the driver via

sudo apt install nvidia-440

The available libraries are for 10.1, 10.2, 11.0:

/% ls /opt/nvidia/hpc_sdk/Linux_x86_64/cuda/
10.1 10.2 11.0

None of them made a difference.

MatColgrove · July 20, 2020, 10:02pm

Unfortunately, I’m not sure then. Your driver is CUDA 10.2 so I would think if my guess was correct and it is a version mismatch, then when you used the CUDA 10.2 libcupti.so, it should worked.

Well, if Nsight-Systems is challenge to get set-up, let’s have you fall back to using the nvprof under the cuda/10.2.bin directory. However since nvprof also uses libcupti.so, you run into the same problem.

xtz465 · July 20, 2020, 11:01pm

When it says

CUDA Driver Version: 10020

does that mean 10.2 or 10.02, 10.020?

I could try to run this:

NVIDIA…450.57.run

but I’ve read that apt install is a better way to install a driver.

Then, there’s this post which I don’t understand and may not be relevant:

xtz465 · July 20, 2020, 11:13pm

addendum:

I tried running
/opt/nvidia/hpc_sdk/Linux_x86_64/cuda/10.2/bin/nvprof

When I do, I get this:

/opt/nvidia/hpc_sdk/Linux_x86_64/cuda/10.2/bin/nvprof: error while loading shared libraries: libcupti.so.10.2: cannot open shared object file: No such file or directory

But the file exists along the same tree:

/opt/nvidia/hpc_sdk/Linux_x86_64/cuda/10.2/lib64/libcupti.so.10.2

Apparently one hand doesn’t know about the other. Is there an environment variable that forces this?

xtz465 · July 21, 2020, 12:01am

addendum 2:

I placed a copy of libcupti.so.10.2 in the working directory. I still got the “not found” message, but nvprof ran. I have some questions about the output.

[Note: I worked hard with C++ classes to force the heavy work with arrays to remain on the GPUs without copying back and forth (the final results only depend upon a sampling of the arrays. I ONLY got it to work under tesla:managed; without the managed keyword, the arrays all ended up zero.]

First, I ran the time command on the executable and got this:

35.025u 2.510s 0:37.89 99.0% 0+0k 0+16io 21385pf+0w

At the end of nvprof, I got this:

==11536== Unified Memory profiling result:
Device “GeForce GT 710 (0)”
Count Avg Size Min Size Max Size Total Size Total Time Name
366 5.0596KB 4.0000KB 12.000KB 1.808594MB 1.121408ms Host To Device
43353 35.177KB 4.0000KB 0.9961MB 1.454407GB 462.0970ms Device To Host
Total CPU Page faults: 21378

This at least suggests that I have minimized the GPU-CPU copying to less than one second out of 37.

However, in the profiling table, I found this:

==11536== Profiling application: SiteRepair --print-gpu-trace
==11536== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 49.97% 17.2702s 5001 3.4533ms 3.4384ms 3.5945ms EulerIntegrationMethod::Increment_636_gpu(void)
27.07% 9.35680s 5001 1.8710ms 1.8447ms 2.0914ms EulerIntegrationMethod::GetLaplacians_585_gpu(void)
11.65% 4.02709s 5001 805.26us 801.83us 810.82us EulerIntegrationMethod::GetLaplacians_571_gpu(void)
10.81% 3.73508s 5001 746.87us 442.64us 749.48us IntegrationMethod::ManageSites_482_gpu(double)
0.48% 166.67ms 5001 33.327us 31.390us 35.902us EulerIntegrationMethod::Increment_681_gpu(void)
0.02% 8.1630ms 1 8.1630ms 8.1630ms 8.1630ms IntegrationMethod::Initialize_422_gpu(void)
0.00% 6.2080us 1 6.2080us 6.2080us 6.2080us IntegrationMethod::Initialize_460_gpu(void)

…followed by this:

  API calls:   93.39%  34.7281s     25007  1.3887ms  3.5960us  8.1614ms  cuEventSynchronize
                5.65%  2.10217s     25007  84.063us  18.405us  1.6821ms  cuLaunchKernel
                0.33%  122.13ms     50014  2.4410us  1.3520us  243.89us  cuEventRecord
                0.24%  90.436ms         1  90.436ms  90.436ms  90.436ms  cuMemAllocManaged

@

Since the entire run time was about 37 seconds, the combination of GPU activities plus API calls adds to more than this (almost double). Is there overcounting here?

Bottom line: if I acquired a graphics card with 6x the CUDA cores of what I have now, might I expect roughly a 6x overall speedup in a specific code like this? I don’t know if something is missing from the profiler output associated with the missing library error message.

MatColgrove · July 21, 2020, 2:37pm

Each section is separate and mostly inclusive of each other. The API times are on the host while the kernel times are on the device. “cuEventSynchronize” is the host blocking waiting for the device to finish. Also the unified memory host to device transfers occur while the kernel is running so included in the kernel time.

Is there an environment variable that forces this?
Setting LD_LIBRARY_PATH should enable the dynamic loader to find it. Double check that you have the correct path set. If it’s correct, then unfortunately I’m not sure why it’s not finding it.

xtz465 · July 21, 2020, 2:56pm

I have now copied libcupti.so.10.2 into my working directory.

With LD_LIBRARY_PATH not set, and I try to run nvprof, I get this:

nvprof: error while loading shared libraries: libcupti.so.10.2: cannot open shared object file: No such file or directory

…and the program exits. If instead I set LD_LIBRARY_PATH=., then nvprof runs, but STILL complains that it cannot find libcupti.so. It would be nice if someone looked into this, as I have little confidence that I’m getting what I’m supposed to get out of nvprof.

Put another way - what do I look for to know that both host and device profiling are present in the log?