call to cuModuleLoadData returned error 209

Hello,

I have just started using NVIDIA’s OpenAcc toolkit2015. I have two GPUs on my system; “Quadro K620” and “Tesla k20c”.
I managed to compile and run a simple example (https://www.youtube.com/watch?v=_do2Dwa29EM) on “Quadro K620”. But when I target tesla, I get this error:

pgcc -acc -ta=tesla:cc35 -o laplas2d-acc laplace2d.c
call to cuModuleLoadData returned error 209: No binary for GPU

I believe pgi supports NVIDIA’s tesla GPUs.
I appreciate any help.

Thanks,
Ali

Hi Ali,

This binary should be fine for the K20 since K20s are compute capability 3.5, but the Quadro K620 is compute capability 5.0 so you’ll either need to change “-ta=tesla:cc35” to “-ta=tesla:cc50”, or simply remove the “cc” sub-option. Without the “cc” sub-option, we create multiple versions of the device code for a variety of compute capabilities. Using a specific “cc” sub-option only creates a single target binary.

Hope this helps,
Mat

Thanks Mat.

I added -ta=tesla:cc35 to target Tesla K20c. I tried cc50 before without any problem and run time was 26 sec. If I remove cc option, I get the same run time. So I assume without cc, it will target Quadro K620. How can I run it only on K20c?

Here is “pgaccelinfo” output:

CUDA Driver Version: 7050
NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.07 Fri May 8 17:48:57 PDT 2015

Device Number: 0
Device Name: Tesla K20c
Device Revision Number: 3.5
Global Memory Size: 5032706048
Number of Multiprocessors: 13
Number of SP Cores: 2496
Number of DP Cores: 832
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 705 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 2600 MHz
Memory Bus Width: 320 bits
L2 Cache Size: 1310720 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc35

Device Number: 1
Device Name: Quadro K620
Device Revision Number: 5.0
Global Memory Size: 2146762752
Number of Multiprocessors: 3
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1124 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 900 MHz
Memory Bus Width: 128 bits
L2 Cache Size: 2097152 bytes
Max Threads Per SMP: 2048
Async Engines: 1
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc50

So I assume without cc, it will target Quadro K620.

Not quite. Without “cc”, multiple device binaries will be embedded in your executable with the decision on which binary to run determined when you first execute.

How can I run it only on K20c?

Sorry, but I’m not quite understanding the question. Do you want it to only run on the K20 or are you ask why it’s currently running on the K620?

To exclusively use the K20, you can set the environment variable “ACC_DEVICE_NUM=0” or call the routine “acc_set_device_num” from your program.

By default, device 0 would be used, so if you’re running on the K620, then you must be setting the device number to 1 someplace.

  • Mat

I wanted to run the parallel loop exclusively on K20c and “ACC_DEVICE_NUM=0” did the trick.

Thanks