PGPROF giving error of incompatible CUDA driver, except when using unified memory with pgc++

Hello!
I am having a issue with profiling my OpenACC program without using -ta=tesla:managed, every time I do I get an error

No kernels were profiled.
No API activities were profiled.
======== Error: incompatible CUDA driver version.
when profiling with pgprof or nvprof

But I have Cuda 10.1 and the paths are set, and I have PGI 19.4 and those paths are set too. Also I am using pgc++. I dont see how unified memory would work when regular manual memory passing fails with such a strange error…

To clarify, when I compile my program with -ta=tesla:managed, I can run the program and can also profile it. When I compile the same program with -ta=tesla (which is what I want), I can run the program but $pgprof ./ fails with the message above.

The code has a few pragma data copies, but even when I comment those out I get the same error.

Has anyone experienced this issue before with PGI 19.2, CUDA 10.1 and pgc++ with -ta=tesla?

Thanks!

Hi mattstack,

What’s your CUDA driver version? Can you post the output from “pgaccelinfo”?

Without “managed”, we’ll use the CUDA Driver runtime rather than the CUDA runtime, so if your driver is older than one that one that supports CUDA 10.1, it could cause this problem.

-Mat

Hi Mat,

Here is the output:
CUDA Driver Version: 10020
NVRM version: NVIDIA UNIX x86_64 Kernel Module 430.26 Tue Jun 4 17:40:52 CDT 2019

Device Number: 0
Device Name: GeForce GTX 1070
Device Revision Number: 6.1
Global Memory Size: 8510701568
Number of Multiprocessors: 15
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1683 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 4004 MHz
Memory Bus Width: 256 bits
L2 Cache Size: 2097152 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Multi-Device: Yes
PGI Default Target: -ta=tesla:cc60

-Thanks!

Hmm, looks like you have the opposite issue where you have a very new driver where pgaccelinfo is detecting it as CUDA 10.2. Unfortunately, I don’t have a system in house with this driver that I can test on so don’t know if these will work, but some possible work arounds are:

  1. Run “pgprof -ta=10.1 …rest of command…”. This is in case the pgprof doesn’t recognize the CUDA 10.2 string and will tell it use CUDA 10.1.

  2. Try running with nvprof instead of pgprof. They are mostly the same profiler, except for the default settings and slight version differences, so would expect nvprof to fail as well, but worth a try. Note you may need to disable OpenMP profiling, i.e. set “–openmp-profiling off”

  3. Try linking with “-Mcuda”. This should change the linking to be closer to that used with “-ta=tesla:managed”.

If none of these work, I’ll try to get a system with this driver installed and see if I can replicate the issue.

The first fix worked! “pgprof -ta=10.1 …rest of command…” works Thanks!
Just for future reference for anyone else with the problem, only the first fix worked for me, the other two still resulted in the error.

Thanks again Mat!

Hi Mat :)
While Matt (also my student :-)) was able to fix this, I ran into the same issue and here are my configs. Matt and I are using 2 diff. systems.

pgaccelinfo:
CUDA Driver Version: 9010
NVRM version: NVIDIA UNIX x86_64 Kernel Module 390.116 Sun Jan 27 07:21:36 PST 2019

Device Number: 0
Device Name: Tesla K40c
Device Revision Number: 3.5
Global Memory Size: 11996954624
Number of Multiprocessors: 15
Number of SP Cores: 2880
Number of DP Cores: 960
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 745 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 3004 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 1572864 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35

Device Number: 1
Device Name: Tesla K40c
Device Revision Number: 3.5
Global Memory Size: 12799574016
Number of Multiprocessors: 15
Number of SP Cores: 2880
Number of DP Cores: 960
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 745 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 3004 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 1572864 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc35
schandra@cisc372:~/2D-Heat/single_gpu$

compilation
schandra@cisc372:~/2D-Heat/single_gpu$ pgcc -acc -ta=tesla:cc35 acc_heat_single.c
compiled and executed alright
schandra@cisc372:~/2D-Heat/single_gpu$ srun -p cisc372 --gres=gpu:2 ./a.out 1024 1024 20000 output.dat
srun: job 121030 queued and waiting for resources
srun: job 121030 has been allocated resources
Time for computing: 3.31 s

now trying to use pgprof
schandra@cisc372:~/2D-Heat/single_gpu$ srun -p cisc372 --gres=gpu:2 pgprof -ta=9.1 ./a.out 256 256 2000 output.data
srun: job 121055 queued and waiting for resources
srun: job 121055 has been allocated resources
pgprof-Error-Switch -ta with unknown keyword 9.1
-ta=9.2|cuda9.2|10.0|cuda10.0|10.1|cuda10.1
srun: error: beowulf: task 0: Exited with exit code 1

OR

schandra@cisc372:~/2D-Heat/single_gpu$ srun -p cisc372 --gres=gpu:2 pgprof -ta=tesla,cuda9.1 ./a.out 256 256 2000 output.dat
srun: job 121056 queued and waiting for resources
srun: job 121056 has been allocated resources
pgprof-Error-Switch -ta with unknown keyword tesla,cuda9.1
pgprof-Error-Switch -ta with unknown keyword cuda9.1
-ta=9.2|cuda9.2|10.0|cuda10.0|10.1|cuda10.1
-ta=9.2|cuda9.2|10.0|cuda10.0|10.1|cuda10.1
srun: error: beowulf: task 0: Exited with exit code 1

So not sure what is the issue here? pgaccel info gave 9.1 as the version, so that’s what I am using.

Many thanks in advance :-)
Sunita

Hi Sunita,

The version of pgprof that you’re using only recognizes Cuda 9.2, 10.0, and 10.1 drivers, so doesn’t recognize the “-ta=9.1” flag.

pgprof-Error-Switch -ta with unknown keyword 9.1
-ta=9.2|cuda9.2|10.0|cuda10.0|10.1|cuda10.1

Though the “-ta” flag shouldn’t be needed on most cases, so try running without this flag.

If you’re still not able to get a profile, you may need to use the nvprof that comes with CUDA 9.1 or update your CUDA Driver.

Hope this helps,
Mat

Thanks very much Mat!

So removing pgprof and -ta and replacing with nvprof worked! :-) I got the text output of the code!!

so I shouldn’t use pgprof at all or is there something that needs fixing or update the CUDA driver to the latest and try again?

Thanks again!
Sunita

so I shouldn’t use pgprof at all or is there something that needs fixing or update the CUDA driver to the latest and try again?

CUDA 9.1 is getting a bit old so if you can, you probably want to update your CUDA driver to a more current version.

In PGI 19.3 we stopped shipping CUDA 9.1 with the compilers, hence needed to remove the “-ta=9.1” option. If you can’t update your driver, you can try setting the environment variable “CUDA_HOME” to a local CUDA 9.1 Toolkit installation, in which case, pgprof should be able to find it.

Though, it might be easier to use NVProf instead. PGPROF and NVPROF are really the same thing, just at different versions and PGPROF will enable CPU and OpenACC profiling by default. With NVPROF you need to enable these options via a command line flag.

-Mat

Awesome thanks Mat! The forum doesn’t notify me of replies, so I didn’t see this just yet… thanks a lot! I will let the admins know of this as well. It should work.
Best,
Sunita

Thanks a lot, this seems to solve my problem I had when running pgprof with CUDA 10.2 (with the compatibility package) and a old driver (410.79) !! I have no idea whether this compatilibity package also allows to profile and debug, so this post was really helpful …

Otherwise, this -ta knob unfortunately does not seem to appear in “pgprof -h”.

Cédric