call to cuModuleLoadData returned error 209: No binary GPU

Hello,

I installed PGI 14.10 on x86-64 CentOS, CUDA V5.5 and two nvidia geForce gtx 295 and compiled the sample acc_c1.c:

pgcc -fast -Minfo -acc -ta=tesla acc_c1.c -o acc_c1.exe

main:
34, Loop not fused: function call before adjacent loop
Generated 3 alternate versions of the loop
Generated vector sse code for the loop
36, Generating present_or_copyout(r[:n])
Generating present_or_copyin(a[:n])
Generating Tesla code
37, Loop is parallelizable
Accelerator kernel generated
37, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
39, Loop not fused: dependence chain to sibling loop
Loop not vectorized: data dependency
Loop unrolled 16 times
Generated 1 prefetches in scalar loop
41, Loop not fused: function call before adjacent loop

But then I got the following error when running the program:

call to cuModuleLoadData returned error 209: No binary for GPU

Any ideas?

This is the code of sample acc_c1.c

/* 
 *     Copyright (c) 2014, NVIDIA CORPORATION.  All rights reserved.
 *
 * NVIDIA CORPORATION and its licensors retain all intellectual property
 * and proprietary rights in and to this software, related documentation
 * and any modifications thereto.  Any use, reproduction, disclosure or
 * distribution of this software and related documentation without an express
 * license agreement from NVIDIA CORPORATION is strictly prohibited.
 *
 */

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int main( int argc, char* argv[] )
{
    int n;      /* size of the vector */
    float *a;  /* the vector */
    float *restrict r;  /* the results */
    float *e;  /* expected results */
    int i, nerrors;
    nerrors = 0;
    if( argc > 1 )
        n = atoi( argv[1] );
    else
        n = 100000;
    if( n <= 0 ) n = 100000;

    a = (float*)malloc(n*sizeof(float));
    r = (float*)malloc(n*sizeof(float));
    e = (float*)malloc(n*sizeof(float));
    /* initialize */
    for( i = 0; i < n; ++i ) a[i] = (float)(i+1);

    #pragma acc kernels loop
    for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;
    /* compute on the host to compare */
    for( i = 0; i < n; ++i ) e[i] = a[i]*2.0f;
    /* check the results */
    for( i = 0; i < n; ++i ) {
        if ( r[i] != e[i] ) {
           nerrors++;
        }
    }
    printf( "%d iterations completed\n", n );
    if ( nerrors != 0 ) {
        printf( "Test FAILED\n");
    } else {
        printf( "Test PASSED\n");
    }
    
    return 0;
}

Hi jcastro9999,

A GTX 295 is a compute capability 1.3 (CC13) device. By default, PGI 14.10 only creates CC20 and CC30 capable device binaries. For CC13, please compile with “-ta=tesla:cc13”.

Note that you may also need to install PGI 14.7 since that was the last release to include the CUDA 5.5 tool chain.

Hope this helps,
Mat

Thanks for the quick response, now it works!