cuModuleLoadData error 209

Hello,

I installed pgcc 14.7 and can compile OpenACC samples with no error. -ta=“tesla,cc1x”. When I run the binary, I got

call to cuModuleLoadData returned error 209: No binary for GPU

pgaccelinfo shows my gpu gforce 9800 and the output is:

CUDA Driver Version: 6050
NVRM version: NVIDIA UNIX x86_64 Kernel Module 340.24 Wed Jul 2 14:24:20 PDT 2014

Device Number: 0
Device Name: GeForce 9800 GT
Device Revision Number: 1.1
Global Memory Size: 1073414144
Number of Multiprocessors: 14
Number of Cores: 112
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 8192
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512, 512, 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 2147483647B
Texture Alignment: 256B
Clock Rate: 1500 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: No
ECC Enabled: No
Memory Clock Rate: 900 MHz
Memory Bus Width: 256 bits
Max Threads Per SMP: 768
Async Engines: 1
Unified Addressing: No
Initialization time: 412790 microseconds
Current free memory: 1034262272
Upload time (4MB): 1761 microseconds (1497 ms pinned)
Download time: 1854 microseconds (1243 ms pinned)
Upload bandwidth: 2381 MB/sec (2801 MB/sec pinned)
Download bandwidth: 2262 MB/sec (3374 MB/sec pinned)
PGI Compiler Option: -ta=tesla:cc11

I experiment with a sample found on PGI Compilers with OpenACC | PGI. https://developer.nvidia.com/content/cudacasts-episode-3
The code I downloaded from NVIDIA is:

/*
 *  Copyright 2012 NVIDIA Corporation
 *
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 */

#include <math.h>
#include <string.h>
#include "timer.h"

#define NN 4096
#define NM 4096

double A[NN][NM];
double Anew[NN][NM];

int main(int argc, char** argv)
{
    const int n = NN;
    const int m = NM;
    const int iter_max = 1000;
    
    const double tol = 1.0e-6;
    double error     = 1.0;
    
    memset(A, 0, n * m * sizeof(double));
    memset(Anew, 0, n * m * sizeof(double));
        
    for (int j = 0; j < n; j++)
    {
        A[j][0]    = 1.0;
        Anew[j][0] = 1.0;
    }
    
    printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
    
    StartTimer();
    int iter = 0;
    
#pragma acc data copy(A), create(Anew)
    while ( error > tol && iter < iter_max )
    {
        error = 0.0;

#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels
        for( int j = 1; j < n-1; j++)
        {
            for( int i = 1; i < m-1; i++ )
            {
                Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
                                    + A[j-1][i] + A[j+1][i]);
                error = fmax( error, fabs(Anew[j][i] - A[j][i]));
            }
        }
        
#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels
        for( int j = 1; j < n-1; j++)
        {
            for( int i = 1; i < m-1; i++ )
            {
                A[j][i] = Anew[j][i];    
            }
        }

        if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
        
        iter++;
    }

    double runtime = GetTimer();
 
    printf(" total: %f s\n", runtime / 1000);
}

-Minfo=accel shows the follows:

50, Generating copy(A[:][:])
Generating create(Anew[:][:])
56, Generating Tesla code
57, Loop is parallelizable
59, Loop is parallelizable
Accelerator kernel generated
57, #pragma acc loop gang /* blockIdx.y /
59, #pragma acc loop gang, vector(128) /
blockIdx.x threadIdx.x /
63, Max reduction generated for error
68, Generating Tesla code
69, Loop is parallelizable
71, Loop is parallelizable
Accelerator kernel generated
69, #pragma acc loop gang /
blockIdx.y /
71, #pragma acc loop gang, vector(128) /
blockIdx.x threadIdx.x */

After searching the forum, most cuModuleLoadData error is 300 which is caused by compute capability mismatch. This is not my case.

I also installed cuda sdk from nvidia, and samples included in the sdk run correctly.

Did I miss any step in setting up pgi compiler.

Thanks,
Xing Fu

The above sample can run without acceleration. The output is correct:

Jacobi relaxation Calculation: 4096 x 4096 mesh
0, 0.250000
100, 0.002397
200, 0.001204
300, 0.000804
400, 0.000603
500, 0.000483
600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 523.140407 s

Hi Xing Fu,

You have a double precision variable in your compute region. Double precision isn’t supported on your device (it was first added with CC 1.3). Try changing all of your “double” variables to “float”.

Hope this helps,
Mat

Mat thanks for your suggestion. I changed the source code to:

#include <math.h>
#include <string.h>
#include "timer.h"

#define NN 4096
#define NM 4096

float A[NN][NM];
float Anew[NN][NM];

int main(int argc, char** argv)
{
    const int n = NN;
    const int m = NM;
    const int iter_max = 1000;
    
    const float tol = 1.0e-6;
    float error     = 1.0;
    
    memset(A, 0, n * m * sizeof(float));
    memset(Anew, 0, n * m * sizeof(float));
        
    for (int j = 0; j < n; j++)
    {
        A[j][0]    = 1.0;
        Anew[j][0] = 1.0;
    }
    
    printf("Jacobi relaxation Calculation: %d x %d mesh\n", n, m);
    
    StartTimer();
    int iter = 0;
    
#pragma acc data copy(A), create(Anew)
    while ( error > tol && iter < iter_max )
    {
        error = 0.0;

#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels
        for( int j = 1; j < n-1; j++)
        {
            for( int i = 1; i < m-1; i++ )
            {
                Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
                                    + A[j-1][i] + A[j+1][i]);
                error = fmax( error, fabs(Anew[j][i] - A[j][i]));
            }
        }
        
#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels
        for( int j = 1; j < n-1; j++)
        {
            for( int i = 1; i < m-1; i++ )
            {
                A[j][i] = Anew[j][i];    
            }
        }

        if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, error);
        
        iter++;
    }

    float runtime = GetTimer();
 
    printf(" total: %f s\n", runtime / 1000);
}

Get the same error. I did further experiment and post the output of the following command. Maybe it can shed some light.

pgcc laplace2d.c -o laplace2d_acc -ta=“tesla,cc1x” -Minfo=accel -dryrun



Reading rcfile /opt/pgi/linux86-64/14.7/bin/.pgccrc
Reading rcfile /opt/pgi/linux86-64/14.7/bin/nativerc
Reading rcfile /opt/pgi/linux86-64/14.7/bin/fnativerc
Reading rcfile /opt/pgi/linux86-64/14.7/bin/ccrc
Reading rcfile /opt/pgi/linux86-64/14.7/bin/x86rc
Reading rcfile /opt/pgi/linux86-64/14.7/bin/x8664rc
Reading rcfile /opt/pgi/linux86-64/14.7/bin/lin86rc
Reading rcfile /opt/pgi/linux86-64/14.7/bin/lin8664rc
Reading rcfile /opt/pgi/linux86-64/14.7/bin/iparc
Reading rcfile /opt/pgi/linux86-64/14.7/bin/acc1rc
Reading rcfile /opt/pgi/linux86-64/14.7/bin/acclin8664rc
Reading rcfile /opt/pgi/linux86-64/14.7/bin/localrc
Skipping localrc.XingFu (not found)
Skipping siterc (not found)
Skipping siterc.XingFu (not found)
Skipping .mypgirc (not found)
Skipping .mypgccrc (not found)
Export PGI=/opt/pgi

/opt/pgi/linux86-64/14.7/bin/pgc laplace2d.c -opt 2 -terse 1 -inform warn -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000 -x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -x 117 0x1000 -quad -x 59 4 -tp istanbul -x 120 0x1000 -astype 0 -stdinc /opt/pgi/linux86-64/14.7/include-gcc48:/opt/pgi/linux86-64/14.7/include:/usr/lib/gcc/x86_64-linux-gnu/4.8/include:/usr/local/include:/usr/lib/gcc/x86_64-linux-gnu/4.8/include-fixed:/usr/include/x86_64-linux-gnu:/usr/include -def unix -def __unix -def unix -def linux -def __linux -def linux -def __NO_MATH_INLINES -def __x86_64 -def x86_64 -def LONG_MAX=9223372036854775807L -def ‘SIZE_TYPE=unsigned long int’ -def ‘PTRDIFF_TYPE=long int’ -def __THROW= -def extension= -def amd_64__amd64 -def __k8 -def k8 -def SSE -def MMX -def SSE2 -def SSE3 -def SSE4A -def ABM -def STDC_HOSTED -predicate ‘#machine(x86_64) #lint(off) #system(posix) #cpu(x86_64)’ -def _ACCEL=201003 -def _OPENACC=201111 -def PGI_TESLA_TARGET -x 123 0x80000000 -x 123 4 -x 119 0x20 -def __pgnu_vsn=40801 -alwaysinline /opt/pgi/linux86-64/14.7/lib/libintrinsics.il 4 -x 120 0x200000 -x 70 0x40000000 -accel tesla -x 163 0x1 -x 186 0x80000 -x 180 0x400 -x 180 0x4000000 -cudaver 5.5 -x 121 0x800 -x 176 0x100 -x 186 0x10000 -x 176 0x100 -x 189 0x800 -x 194 0x01 -x 194 0x02 -y 70 0x40000000 -x 189 0x8000 -y 189 0x4000000 -x 0 0x1000000 -x 2 0x100000 -x 0 0x2000000 -x 161 16384 -x 162 16384 -asm laplace2d.s

/usr/bin/as laplace2d.s -o laplace2d.o

/opt/pgi/linux86-64/14.7/bin/pgacclnk -nvidia /opt/pgi/linux86-64/14.7/bin/pgnvd -cuda5.5 -computecap=1x /usr/bin/ld /usr/lib/x86_64-linux-gnu/crt1.o /usr/lib/x86_64-linux-gnu/crti.o /opt/pgi/linux86-64/14.7/lib/trace_init.o /usr/lib/gcc/x86_64-linux-gnu/4.8/crtbegin.o /opt/pgi/linux86-64/14.7/lib/initmp.o -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 /opt/pgi/linux86-64/14.7/lib/pgi.ld -L/opt/pgi/linux86-64/14.7/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-linux-gnu/4.8 laplace2d.o -rpath /opt/pgi/linux86-64/14.7/lib -rpath /opt/pgi/linux86-64/14.7/libso -rpath /opt/pgi/linux86-64/2014/cuda/5.5/lib64 -o laplace2d_acc -laccapi -laccg -laccn -laccg2 -ldl -lpgmp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc /usr/lib/gcc/x86_64-linux-gnu/4.8/crtend.o /usr/lib/x86_64-linux-gnu/crtn.o

I will post more info since I am installing the bits to another system to see what will happen.[/quote]

I installed Windows version bits on a notebook with Quadro 1000M which supports -ta=tesla:cc20. It works very well.

Great, glad you got it to work on the CC2.0 card.

I forgot to mention before that C constant floating point values are double precision by default, so you either need to decorate them with “f” or add the “-Mfcon” flag to make the single precision.

Though, moving to a 2.0 device is the best choice given the age and missing features on a 1.1 device.

  • Mat

Same error here!!!

call to cuModuleLoadData returned error 209: No binary for GPU

My output from “pgaccelinfo” is:

CUDA Driver Version:           6050
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  343.19  Thu Sep  4 22:43:36 PDT 2014

Device Number:                 0
Device Name:                   GeForce GTX 970
Device Revision Number:        5.2
Global Memory Size:            4294246400
Number of Multiprocessors:     13
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1177 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             3505 MHz
Memory Bus Width:              256 bits
L2 Cache Size:                 1835008 bytes
Max Threads Per SMP:           2048
Async Engines:                 2
Unified Addressing:            Yes
Managed Memory:                Yes
Initialization time:           4419 microseconds
Current free memory:           3630735360
Upload time (4MB):              642 microseconds ( 390 ms pinned)
Download time:                  793 microseconds ( 349 ms pinned)
Upload bandwidth:              6533 MB/sec (10754 MB/sec pinned)
Download bandwidth:            5289 MB/sec (12018 MB/sec pinned)

To compile I used:

pgcc -acc -Minfo=accel -o laplace2d_acc laplace2d.c

I did change the variables from double to float but the erros ramains the same!

Hi cvelez99,

Your issue is that we don’t support the Maxwell (cc5.x) architectures since they are only available with GTX graphics devices. PGI products target the Tesla product line which is for the computing market. In some cases, the Tesla and GTX product lines share the same architecture, but not in Maxwell’s case.

Sorry,
Mat