Error running GPU code as DLL

I have a project where I’ve extracted the code that is to be run on the GPU to a DLL, and dynamically link this DLL to the main project. To facilitate, I’ve written a small sample EXE for testing, which I can build to use the DLL or statically link the GPU library. The DLL is written/compiled in C, the EXE in C++.

The GPU code runs when compiled statically with the sample EXE as either 32-bit or 64-bit, and dynamically linked when compiled 32-bit.

HOWEVER, I get a runtime error and everything fails when dynamically linked as 64-bit. Here is the error:

The accelerator does not match the profile for which this program was compiled
Current file:     C:\Users\Sentry360\Desktop\360API.GCC\ImageProcessorGPU.c
        function: InnerProcessImageGPU
        line:     167
Current region was compiled for:
  NVIDIA Tesla GPU sm10 sm20 sm30
Available accelerators:
  device[1]: Native X86 (CURRENT DEVICE)

It seems during initialization my device 0 is not being found. The strange part is pgaccelinfo seems to run fine, and the other 3 scenarios run fine as well.

CUDA Driver Version:           6050

Device Number:                 0
Device Name:                   GeForce GTX 560
Device Revision Number:        2.1
Global Memory Size:            1073741824
Number of Multiprocessors:     7
Number of Cores:               224
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1701 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             2052 MHz
Memory Bus Width:              256 bits
L2 Cache Size:                 524288 bytes
Max Threads Per SMP:           1536
Async Engines:                 1
Unified Addressing:            Yes
Current free memory:           818917376
Upload time (4MB):             1040 microseconds ( 690 ms pinned)
Download time:                 1320 microseconds ( 690 ms pinned)
Upload bandwidth:              4032 MB/sec (6078 MB/sec pinned)
Download bandwidth:            3177 MB/sec (6078 MB/sec pinned)
PGI Compiler Option:           -ta=tesla:cc20
clGetDeviceIDs returns code -1

Any ideas?

I realize my previous post was necessarily vague… so here is a very simple reproducer with source code and compilation commands. Source is direct from the Getting Started PDF.

Build and run on Win7 Pro 64-bit.

dll.c

__declspec(dllexport) void __cdecl vecaddgpu( float *restrict r, float *a, float *b, int n )
{
    #pragma acc kernels loop copyin(a[0:n],b[0:n]) copyout(r[0:n])
    for( int i = 0; i < n; ++i ) r[i] = a[i] + b[i];
}

app.c

#include <stdio.h>
#include <stdlib.h>

__declspec(dllimport) void __cdecl vecaddgpu( float *restrict r, float *a, float *b, int n );

int main( int argc, char* argv[] )
{
    int n; /* vector length */
    float * a; /* input vector 1 */
    float * b; /* input vector 2 */
    float * r; /* output vector */
    float * e; /* expected output values */
    int i, errs;
    if( argc > 1 ) n = atoi( argv[1] );
    else n = 100000; /* default vector length */
    if( n <= 0 ) n = 100000;
    a = (float*)malloc( n*sizeof(float) );
    b = (float*)malloc( n*sizeof(float) );
    r = (float*)malloc( n*sizeof(float) );
    e = (float*)malloc( n*sizeof(float) );
    for( i = 0; i < n; ++i )
    {
        a[i] = (float)(i+1);
        b[i] = (float)(1000*i);
    }
    /* compute on the GPU */
    vecaddgpu( r, a, b, n );
    /* compute on the host to compare */
    for( i = 0; i < n; ++i ) e[i] = a[i] + b[i];
    /* compare results */
    errs = 0;
    for( i = 0; i < n; ++i )
    {
        if( r[i] != e[i] )
        {
            ++errs;
        }
    }
    printf("%d errors found\n", errs );
    return errs;
}

Build

64-bit

pgcc -c -fast -O2 -Bstatic -m64 -Minfo -acc -ta=nvidia:cc1+ dll.c
pgcc -Mmakedll -fast -O2 -Bstatic -m64 -Minfo -acc -ta=nvidia:cc1+ -defaultlib:libaccg -defaultlib:libaccg2 -o dll.dll dll.obj
pgcc -fast -O2 -Bstatic -m64 -defaultlib:dll -o app.exe app.c


32-bit (to prove it works this way)

pgcc -c -fast -O2 -Bstatic -m32 -Minfo -acc -ta=nvidia:cc1+ dll.c
pgcc -Mmakedll -fast -O2 -Bstatic -m32 -Minfo -acc -ta=nvidia:cc1+ -o dll.dll dll.obj
pgcc -fast -O2 -Bstatic -m32 -defaultlib:dll -o app.exe app.c

Here again is the error code I receive from the 64-bit compile (pgaccelinfo is in my first post)

The accelerator does not match the profile for which this program was compiled
Current file: C:\Users\xxxx\Desktop\test64\dll.c
function: vecaddgpu
line: 3
Current region was compiled for:
NVIDIA Tesla GPU sm10 sm20 sm30
Available accelerators:
device[1]: Native X86 (CURRENT DEVICE)

Hi Erik,

You’re the first that I know of that’s wanted to OpenACC regions within a DLL. Hence, it’s not something we’ve put much effort in. I suspect what’s going on is that since DLLs have there own memory space, the runtime isn’t getting initialized properly. Most likely it only works in 32-bits since you happen to be use some default values. If I change your 32-bit compile to target any other device than cc1+, the binary fails in the same manner as 64-bits.

I’ve put in an RFE (TPR#20827) and hopefully we can do something to help you.

Thanks,
Mat