No Binary for GPU on P100 when running executable

I am trying to compile a fortran code with OpenACC pragmas with PGI 16.10 on a node that has a P100 GPU and an IBM Power 8 CPU. I can successfully run this exact code with a Tesla K40 machine, and had to remove some kernels from the P100 version to get it to compile. Now it does compile, but I can’t run it on the GPU. Compilation gives no error messages, but a few warnings. When I run the executable, I get the following error message:

call to cuModuleLoadData returned error 209: No binary for GPU

Trying to get more information as to what is happening, I run with nvprof the executable and get the following:

call to cuModuleLoadData returned error 209: No binary for GPU
==51080== Profiling application: ./higrad_basic_openacc tests/idexp2d/idexp2d.in
==51080== Profiling result:
No kernels were profiled.

==51080== Unified Memory profiling result:
Total CPU Page faults: 147

==51080== API calls:
Time(%) Time Calls Avg Min Max Name
64.52% 128.29ms 1 128.29ms 128.29ms 128.29ms cuDevicePrimaryCtxRetain
23.19% 46.111ms 1 46.111ms 46.111ms 46.111ms cuDevicePrimaryCtxRelease
11.30% 22.465ms 51 440.49us 5.7560us 20.342ms cuMemAllocManaged
0.60% 1.2029ms 37 32.511us 397ns 110.04us cuMemFree
0.21% 413.45us 1 413.45us 413.45us 413.45us cuMemAllocHost
0.14% 283.42us 1 283.42us 283.42us 283.42us cuMemAlloc
0.01% 25.411us 2 12.705us 9.2560us 16.155us cuModuleLoadData
0.01% 19.000us 1 19.000us 19.000us 19.000us cuStreamCreate
0.00% 8.1520us 1 8.1520us 8.1520us 8.1520us cuCtxSynchronize
0.00% 5.0330us 3 1.6770us 456ns 4.0820us cuDeviceGetCount
0.00% 3.6670us 3 1.2220us 426ns 2.1720us cuCtxSetCurrent
0.00% 3.0540us 8 381ns 259ns 650ns cuDeviceGetAttribute
0.00% 2.3740us 6 395ns 307ns 474ns cuDeviceGet
0.00% 805ns 2 402ns 303ns 502ns cuDeviceComputeCapability
0.00% 601ns 1 601ns 601ns 601ns cuCtxGetDevice

==51080== OpenACC (excl):
Time(%) Time Calls Avg Min Max Name
100.00% 15.724us 1 15.724us 15.724us 15.724us acc_device_init
======== Error: Application returned non-zero code 1




Running pgaccelinfo gives me:

CUDA Driver Version: 8000
NVRM version: NVIDIA UNIX ppc64le Kernel Module 361.107 Sun Nov 6 20:32:15 PST 2016

Device Number: 0
Device Name: Tesla P100-SXM2-16GB
Device Revision Number: 6.0
Global Memory Size: 17071669248
Number of Multiprocessors: 56
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1480 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 715 MHz
Memory Bus Width: 4096 bits
L2 Cache Size: 4194304 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc60

Device Number: 1
Device Name: Tesla P100-SXM2-16GB
Device Revision Number: 6.0
Global Memory Size: 17071669248
Number of Multiprocessors: 56
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1480 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 715 MHz
Memory Bus Width: 4096 bits
L2 Cache Size: 4194304 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc60


Of which the compiler flags I am using are:

-r8 -Mpreprocess -Mextend -acc -fast -Minfo=accel -ta=tesla:cc60

Help and insight into what might be going on is greatly appreciated!

Do you both compile and link with the -ta=tesla:cc60 options? If the build process is not too involved, add the -v (verbose) option to your build and include the output.