NVVP does not generate timeline - compiler options??

I am learning to use pgfortran and openACC. I build the three examples of “laplace2d” described in the forall blog

https://devblogs.nvidia.com/parallelforall/openacc-example-part-1/
https://devblogs.nvidia.com/parallelforall/openacc-example-part-2/

It is three times exactly the same program, but stepwise added openACC directives for an NVIDIA GPU.

All three version were compiled with the “-acc -ta=nvidia” options and run as described. However, NVVP creates a timeline only for the first version. (Actually, I made several other examples from the web that run correctly, but version 1 of the laplace2d program is the only one that ever produced a timeline. According to the Profiler manual there should be no special compiler options required.

In most cases NVVD just crashes the program, with an “unexpected return value -1073741819” or similar. In other cases the program obviously runs (i.e. sends the correct output to the console), but still produces no timeline.

This looks very erratic to me.

I should mention that I have PGI fortran installed (Windows 10), which provides both CUDA 7.0 and CUDA 7.5. I have also CUDA8.0 installed, but use 7.5 now since the most recent driver for 8.0 does not work correctly with my old M2090 card.

I would be very grateful for any help, or a link to a tutoral (I mean one for CUDA beginners - I have 30 years of FORTRAN experience, but are only now starting with GPU programming).

Hi bdick12925,

I tried to recreate this error, but it seems to work for me.

What type of device and CUDA driver version are you using? (you can run the PGI utility “pgaccelinfo” if you don’t know).

If you are using a GT or GTX card on Windows, you’ll be using the Windows Display Driver Model (WDDM) driver. In order to protect against your graphics card from freezing, WDDM has a watch dog timer that will kill long running device processes. My one thought is that the longer running second example is getting kill by the watch dog timer and causing you NVVP failure.

There are ways to increase your watch dog timer but involve hacking your Windows registry so not something I can recommend.

The recommended method is to use the Tesla Compute Cluster (TCC) driver. However, this driver is only available for use with the Tesla brand cards on Windows which are made for computation. On Linux, TCC is the default driver.

  • Mat

my computer is a skylake i7-6700K @ 4.00 GHz, 64 GB RAM,
Windows 10, or Ubuntu Linux 16.04
I am using the Graphics on the chip (Intel HD graphic 530)

As I wrote in my post, my GPU is a M2090 (got this as a spare part, made my own fans etc.) This has no video output.

I made the examples in windows, since I got the PGI compilers (15 day license) for windows.

The first version is the slowest (64 sec), i.e., if timing were a problem, this one should not make a timeline - but in fact it is the only one that makes one. step2 needs 5.7 sec, step3 only 3.3 sec

According to the manual, no special preparation is required for the compilation of an executable that I want to profile, right?

with best regards,
Bernhard

Dear Mat,
I just tried it with the Linux versions of the PGI compilers, same computer, only now boot with Ubuntu 16.04

All examples compile and run. With the pgprof GUI, they create a timeline. I have installed CUDA-8.0 The driver is 361.77

So the problem seems to be either with the windows version of the PGI compilers (do they require other/additional compiler options?), or with the CUDA version ( I used CUDA 7.5 there, since the driver installed by CUDA 8.0 seems not to work with my M2090 card, so I went back to the previous driver and used CUDA 7.5).

best regards,
Bernhard

Hi Bernhard,

As I wrote in my post, my GPU is a M2090

Sorry I missed that you said you are using a Tesla card. I’ll assume then that you have a TCC driver installed.

The first version is the slowest (64 sec), i.e., if timing were a problem, this one should not make a timeline - but in fact it is the only one that makes one. step2 needs 5.7 sec, step3 only 3.3 sec

The time in the first example is mostly due to data movement which the watch dog timer does not monitor. Though, I agree that it’s probably not the watch dog timer that’s causing the issue. Unfortunately, I don’t know what’s causing the issue.

Have you tried using PGPROF on Windows instead of NVVP? The 2016 PGPROF is based on Nvprof with the addition of CPU performance profiling. However, it does use CUDA 8.0 under the hood so may still give you issues.

  • Mat

Hi Mat,
I think it is an issue with the driver. In the Linux installation, the driver is 361.77, and everything is working. On Windows, the version of the driver that gets installed by CUDA8.0 is 369.30. I compiled three versions, for each of the three “steps” in the laplace2d example.

pgfortran -acc -ta=tesla,cuda8.0 -o s1.exe laplace2d.f90
pgfortran -acc -ta=tesla,cuda7.5 -o s2.exe laplace2d.f90
pgfortran -acc -ta=tesla,cuda7.0 -o s3.exe laplace2d.f90

(the latter are part of the PGI installation). None of these programs run with the CUDA8.0 driver installed. (opposite to the CUDA8.0 on Linux, but there the driver is 361.77)

I went to the NVIDIA site and searched for the latest driver with the settings for m2090, and that is 356.54. After I installed that, all 9 programs run ok.

However, NVVP starts the cuda 8.0 version, and this claims to be incompatible with the cuda driver installed (i.e. 356.54).

PGPROF, on the other hand, does not complain and runs the program ( i get the console output at the end) but PGPROF never comes to an end.
The version that is started is from the cuda/7.5-pgprof/bin directory.

So I could not reproduce the one instance where I actually got a timeline.

I conclude that the driver 369.30 is incompatible with the compiled programs, but the profilers dont like the older driver 356.54. So I will probably need the driver that is the equivalent to 361.77 for Linux.
best regards,
Bernhard