execution error using -ta=tesla:cc35 on K80

Trying to run a simple OpenACC program on a K80.
Using command
pgcc -o test -acc -ta=tesla:cc35 -Minfo=all test.c

The code compiles fine, and Minfo prints out all the relevant compiler optimizations, however the code throws this error while running.

call to cuModuleLoadData returned error 209: No binary for GPU

The code works fine with a -ta=tesla:cc60 target, since that targets the GTX 1080, and -ta=multicore works as well.
How could I get the code to run with the K80 as the target?
Using compiler pgi/16.10

The following is the info from pgaccelinfo.

CUDA Driver Version: 8000
NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.57 Mon Oct 3 20:37:01 PDT 2016

Device Number: 0
Device Name: GeForce GTX 1080
Device Revision Number: 6.1
Global Memory Size: 8507555840
Number of Multiprocessors: 20
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1733 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 5005 MHz
Memory Bus Width: 256 bits
L2 Cache Size: 2097152 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc60

Device Number: 1
Device Name: Tesla K80
Device Revision Number: 3.7
Global Memory Size: 11995578368
Number of Multiprocessors: 13
Number of SP Cores: 2496
Number of DP Cores: 832
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 823 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 2505 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 1572864 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc35

Device Number: 2
Device Name: Tesla K80
Device Revision Number: 3.7
Global Memory Size: 11995578368
Number of Multiprocessors: 13
Number of SP Cores: 2496
Number of DP Cores: 832
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 823 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 2505 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 1572864 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc35

Hi efwright,

Since the GTX 1080 is the default device (i.e. device 0), you’ll need to set the device number to use either 1 or 2 for the K80. You can either do this in the program by calling the API routine “acc_set_device_num” or the environment variable “ACC_DEVICE_NUM=1”.

Note that you can compile with “-ta=tesla:cc35,cc60” so the binary is built for both devices and then you can use the environment variable to select which one to run on.

Hope this helps,
Mat