Error running GPU code as DLL

erik.sentry360 · August 28, 2014, 10:58pm

I have a project where I’ve extracted the code that is to be run on the GPU to a DLL, and dynamically link this DLL to the main project. To facilitate, I’ve written a small sample EXE for testing, which I can build to use the DLL or statically link the GPU library. The DLL is written/compiled in C, the EXE in C++.

The GPU code runs when compiled statically with the sample EXE as either 32-bit or 64-bit, and dynamically linked when compiled 32-bit.

HOWEVER, I get a runtime error and everything fails when dynamically linked as 64-bit. Here is the error:

The accelerator does not match the profile for which this program was compiled
Current file:     C:\Users\Sentry360\Desktop\360API.GCC\ImageProcessorGPU.c
        function: InnerProcessImageGPU
        line:     167
Current region was compiled for:
  NVIDIA Tesla GPU sm10 sm20 sm30
Available accelerators:
  device[1]: Native X86 (CURRENT DEVICE)

It seems during initialization my device 0 is not being found. The strange part is pgaccelinfo seems to run fine, and the other 3 scenarios run fine as well.

CUDA Driver Version:           6050

Device Number:                 0
Device Name:                   GeForce GTX 560
Device Revision Number:        2.1
Global Memory Size:            1073741824
Number of Multiprocessors:     7
Number of Cores:               224
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1701 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             2052 MHz
Memory Bus Width:              256 bits
L2 Cache Size:                 524288 bytes
Max Threads Per SMP:           1536
Async Engines:                 1
Unified Addressing:            Yes
Current free memory:           818917376
Upload time (4MB):             1040 microseconds ( 690 ms pinned)
Download time:                 1320 microseconds ( 690 ms pinned)
Upload bandwidth:              4032 MB/sec (6078 MB/sec pinned)
Download bandwidth:            3177 MB/sec (6078 MB/sec pinned)
PGI Compiler Option:           -ta=tesla:cc20
clGetDeviceIDs returns code -1

Any ideas?

erik.sentry360 · September 4, 2014, 5:03pm

I realize my previous post was necessarily vague… so here is a very simple reproducer with source code and compilation commands. Source is direct from the Getting Started PDF.

Build and run on Win7 Pro 64-bit.

dll.c

__declspec(dllexport) void __cdecl vecaddgpu( float *restrict r, float *a, float *b, int n )
{
    #pragma acc kernels loop copyin(a[0:n],b[0:n]) copyout(r[0:n])
    for( int i = 0; i < n; ++i ) r[i] = a[i] + b[i];
}

app.c

#include <stdio.h>
#include <stdlib.h>

__declspec(dllimport) void __cdecl vecaddgpu( float *restrict r, float *a, float *b, int n );

int main( int argc, char* argv[] )
{
    int n; /* vector length */
    float * a; /* input vector 1 */
    float * b; /* input vector 2 */
    float * r; /* output vector */
    float * e; /* expected output values */
    int i, errs;
    if( argc > 1 ) n = atoi( argv[1] );
    else n = 100000; /* default vector length */
    if( n <= 0 ) n = 100000;
    a = (float*)malloc( n*sizeof(float) );
    b = (float*)malloc( n*sizeof(float) );
    r = (float*)malloc( n*sizeof(float) );
    e = (float*)malloc( n*sizeof(float) );
    for( i = 0; i < n; ++i )
    {
        a[i] = (float)(i+1);
        b[i] = (float)(1000*i);
    }
    /* compute on the GPU */
    vecaddgpu( r, a, b, n );
    /* compute on the host to compare */
    for( i = 0; i < n; ++i ) e[i] = a[i] + b[i];
    /* compare results */
    errs = 0;
    for( i = 0; i < n; ++i )
    {
        if( r[i] != e[i] )
        {
            ++errs;
        }
    }
    printf("%d errors found\n", errs );
    return errs;
}

Build

64-bit

pgcc -c -fast -O2 -Bstatic -m64 -Minfo -acc -ta=nvidia:cc1+ dll.c
pgcc -Mmakedll -fast -O2 -Bstatic -m64 -Minfo -acc -ta=nvidia:cc1+ -defaultlib:libaccg -defaultlib:libaccg2 -o dll.dll dll.obj
pgcc -fast -O2 -Bstatic -m64 -defaultlib:dll -o app.exe app.c

32-bit (to prove it works this way)

pgcc -c -fast -O2 -Bstatic -m32 -Minfo -acc -ta=nvidia:cc1+ dll.c
pgcc -Mmakedll -fast -O2 -Bstatic -m32 -Minfo -acc -ta=nvidia:cc1+ -o dll.dll dll.obj
pgcc -fast -O2 -Bstatic -m32 -defaultlib:dll -o app.exe app.c

Here again is the error code I receive from the 64-bit compile (pgaccelinfo is in my first post)

The accelerator does not match the profile for which this program was compiled
Current file: C:\Users\xxxx\Desktop\test64\dll.c
function: vecaddgpu
line: 3
Current region was compiled for:
NVIDIA Tesla GPU sm10 sm20 sm30
Available accelerators:
device[1]: Native X86 (CURRENT DEVICE)

MatColgrove · September 5, 2014, 9:09pm

Hi Erik,

You’re the first that I know of that’s wanted to OpenACC regions within a DLL. Hence, it’s not something we’ve put much effort in. I suspect what’s going on is that since DLLs have there own memory space, the runtime isn’t getting initialized properly. Most likely it only works in 32-bits since you happen to be use some default values. If I change your 32-bit compile to target any other device than cc1+, the binary fails in the same manner as 64-bits.

I’ve put in an RFE (TPR#20827) and hopefully we can do something to help you.

Thanks,
Mat

Topic		Replies	Views
No Available accelerator Legacy PGI Compilers	7	6665	November 9, 2016
accelerator does not match the profile error Legacy PGI Compilers	4	5554	August 25, 2016
Error: internal error: invalid thread id Legacy PGI Compilers	13	2549	August 12, 2020
Compilation for Tesla P100 Legacy PGI Compilers	4	5830	January 10, 2017
host_data fails for dynamically loaded library Legacy PGI Compilers	3	1140	May 12, 2020
unresolved external symbol acc_init_ referenced Legacy PGI Compilers	2	6830	March 3, 2010
OpenACC kernels in Windows DLLs Legacy PGI Compilers	14	7797	January 6, 2016
Hardware capability problem Legacy PGI Compilers	3	1676	March 5, 2019
PGI Visual Fortran 2005 Accelerator on nVidia Quadro NVS 290 Legacy PGI Compilers	5	12841	April 7, 2010
Unresolved external symbol with OpenACC (fortran) Legacy PGI Compilers	3	2777	November 30, 2017

Error running GPU code as DLL

Related topics