Help!! I can't get my NVidia GeForce GT 525M to load in a single CUDA PTX kernel!!

Try as I might, I just can’t get my NVidia GeForce GT 525M to load in a single CUDA PTX kernel using the CUDA Driver API!!

How and where can I get professional support for this problem?

Every time my 32-bit Windows program atempts to load an ASCII, zero-terminated PTX kernel into the GPU, using the CUDA Driver API function, cuModuleLoadDataEx(), it always returns CUDA_ERROR_NO_BINARY_FOR_GPU.

It’s not a linkage problem, because previous calls to the CUDA Driver API functions, cuInit(), cuDeviceGetCount(), cuDeviceGet(), cuDeviceComputeCapability(), cuDeviceGetAttribute(), and cuCtxCreate() all work great (they return no error).

I originally thought that I must be doing something wrong, but I’ve checked my program code, re-checked it, then re-checked it again, and again, and again, and… well, you get the idea. This problem is nine months old, and I’ve tried every permutation I can think of.

The parameters I’m using (as of this writing) to load in the kernel are:

CU_JIT_MAX_REGISTERS = 32
CU_JIT_INFO_LOG_BUFFER = new char [1024]
CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES = 1024
CU_JIT_ERROR_LOG_BUFFER = new char [1024]
CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES = 1024
CU_JIT_TARGET = CU_TARGET_COMPUTE_21
CU_JIT_FALLBACK_STRATEGY = CU_PREFER_PTX

Nine months ago, I originally posted the problem here:

[url]CUDA_ERROR_NO_BINARY_FOR_GPU loading PTX ? Can't load PTX 'image' no matter what I do.. - CUDA Programming and Performance - NVIDIA Developer Forums

and got one reply, but no answers.

I really need an answer to this, because being able to finish and test the software on this relatively tiny CUDA-based GPU is basically the sole determinant for the feasibility of buying what will probably be a very expensive system, replete with multiple high-end Tesla cards. In short, how can I shell out $12,000 or more for that kind of hardware, when I don’t even have the software to use it?

So is there any billable support mechanism for NVidia CUDA programming, where I can get a definitive answer to this dilemma?

If it helps, here are more specifics on this seemingly insurmountable problem:

Machine: Dell Inspiron N7110
Chipset: GeForce GT 525M, using NVidia Optimus (Is Optimus the problem?)
OS: Windows 7 64-bit Professional, latest build

Of course, I can provide lot’s more specifics on the problem, but I think it best not to get ahead of myself here…

Can anyone offer any usable advice as to what may be wrong, or who/where I can pay for an answer?

Thanks in advance.

I have seen this error before when I passed invalid PTX code to cuModuleLoadDataEx. But it also looks like you may be specifying the options wrong.

If you post the actual API code you are running we might be able to help more.

The LAMMPS/GPU package works this way, the source code is public and you can download it from

http://lammps.sandia.gov/tars/lammps-14Oct12.tar.gz

for example. I think the main file you would need to look at is

LAMMPS/lib/gpu/geryon/nvd_kernel.h

You may be able to distil all the details for PTX load from this, I’m not sure. Since you’re running on Windows you wouldn’t actually be able to build LAMMPS to be sure that the steps are sufficient in your environment.

If you’re handing in a PTX file with a compatible .target then why not use the CU_JIT_TARGET_FROM_CUCCONTEXT instead of explicitly stating a compute target?

Also, is your PTX hand-written? Are you sure it’s ASCII?

For what it’s worth, I’ve had good results across pretty much all architectures and drivers loading cubins in fatbins via cuModuleLoadFatBinary(). Unfortunately, I never tried embedding PTX’s. :|

Many thanks to all three of you for your replies.

Mark: Well, I knew that if anyone replied (and that was far from certain) that at least one of them would ask to see the source code (it’s what I would do). So, because the source code itself includes a whole mess of stuff irrelevant to the problem, I created the following, which is essentially what the program does, without the other, entirely superfluous code:

#include "stdafx.h"
#include "CUDA.h"       // <== includes: #include "cuda_drvapi_dynlink_cuda.h"

extern "C" {
#include "cuda_drvapi_dynlink.c"    // <== ..to dynamically link in the 4.0 CUDA API..
}

/***************
* CONFIGURATION *
***************/

#define PTXINFLOGSZE    1024    // <== Size, in bytes, of the buffer to hold informational log messages from the PTX assembly..
#define PTXERRLOGSZE    1024    // <== Size, in bytes, of the buffer to hold any error log messages from the PTX assembly....

/**************
* ENUMERATIONS *
**************/

enum    // These are the named indeces of the LoadOpts[] array, and as such, MUST exactly match the number and order of the elements of that array..
{
    OptNdx_MAX_REGS,
//  OptNdx_BLK_THREADS,
    OptNdx_INFO_BUFF,
    OptNdx_INFO_BUFF_SZE,
    OptNdx_ERR_BUFF,
    OptNdx_ERR_BUFF_SZE,
    OptNdx_TARGET,
    OptNdx_FALLBACK
};

/**************
* LOCAL MACROS *
**************/

#define Elems( array )  (sizeof(array)/sizeof(*(array)))

#define CALL_DRVR_API( func, call ) 
if ( (Rslt = (call)) != CUDA_SUCCESS ) GenError( Caller, ERR_CUDACALL, #func, Rslt )

#define FREE_ARRAY_PTR( A ) 
if ( A ) { delete [] A; A = NULL; }

/*****************
* LOCAL CONSTANTS *
*****************/

static CUjit_option LoadOpts[] =
{
    CU_JIT_MAX_REGISTERS,               // <== (unsigned int) input specifies the maximum number of registers per thread..
//  CU_JIT_THREADS_PER_BLOCK,           // <== (unsigned int) input specifies number of threads per block to target compilation for..
    CU_JIT_INFO_LOG_BUFFER,             // <== (char *) input is a pointer to a buffer in which to print any informational log messages from PTX assembly..
    CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES,  // <== (unsigned int) input is the size in bytes of the buffer; output is the number of bytes filled with messages..
    CU_JIT_ERROR_LOG_BUFFER,            // <== (char *) input is a pointer to a buffer in which to print any error log messages from PTX assembly..
    CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES, // <== (unsigned int) input is the size in bytes of the buffer; output is the number of bytes filled with messages..
    CU_JIT_TARGET,                      // <== (unsigned int for enumerated type CUjit_target_enum) input is the compilation target based on supplied CUjit_target_enum..
    CU_JIT_FALLBACK_STRATEGY            // <== (unsigned int for enumerated type CUjit_fallback_enum) chooses fallback strategy if matching cubin is not found..
};

static void *LoadOptVals[ Elems( LoadOpts ) ] =
{
    (void *)32,                     // <== (unsigned int) The maximum number of registers per thread to use for this kernel..
//  (void *)1024,                   // <== (unsigned int) The number of threads per block; output returns the number of threads the compiler actually targeted..
    (void *)NULL,                   // <== (char *) Pointer to a dynamically allocated buffer which holds any informational log messages from the PTX assembly..
    (void *)PTXINFLOGSZE,           // <== (unsigned int) Size, in bytes, of the aforementioned informational log message buffer; output is the number of bytes filled with messages..
    (void *)NULL,                   // <== (char *) Pointer to a dynamically allocated buffer which holds any error log messages from the PTX assembly..
    (void *)PTXERRLOGSZE,           // <== (unsigned int) Size, in bytes, of the aforementioned error log message buffer; output is the number of bytes filled with messages..
    (void *)CU_TARGET_COMPUTE_21,   // <== (unsigned int) Designates the intended 'compilation target'..
    (void *)CU_PREFER_PTX           // <== (unsigned int) Designates that the 'fallback strategy', if no matching cubin is found, is to assume PTX..
};

// This is the actual kernel I'm trying to load:

static char MiniKern[] =
".version 2.2
"
".target sm_20
"
".address_size 32

"
".global .u32 Frst[512];
"
".global .u32 Scnd[512];
"
".global .u32 USum[512];

"
".entry AddVec()
"
"{
"
"	.reg .u32 ndx;

"
"	.reg .u32 Fptr;
"
"	.reg .u32 Sptr;
"
"	.reg .u32 Uptr;

"
"	.reg .u32 Fval;
"
"	.reg .u32 Sval;
"
"	.reg .u32 Uval;

"
"	cvta.global.u32 Fptr, Frst;
"
"	cvta.global.u32 Sptr, Scnd;
"
"	cvta.global.u32 Uptr, USum;

"
"	mad.lo.u32 ndx, %ctaid.x, %ntid.x, %tid.x;
"
"	shl.b32 ndx, ndx, 2;

"
"	add.u32 Fptr, Fptr, ndx;
"
"	add.u32 Sptr, Sptr, ndx;
"
"	add.u32 Uptr, Uptr, ndx;

"
"	ld.global.u32 Fval, [Fptr];
"
"	ld.global.u32 Sval, [Sptr];

"
"	add.u32 Uval, Fval, Sval;
"
"	st.global.u32 [Uptr], Uval;

"
"	ret.uni;
"
"}
";

/*****************
* LOCAL VARIABLES *
*****************/

static unsigned NumDevs;    // <== total number of NVidia GPUs in the system, usable or not..

/********************
* EXTERNAL VARIABLES *
********************/

extern char cuErrMsg[];     // <== cuInit() is called elsewhere, and on error, fills this with an ASCII error message..

/*****************************
* GLOBAL FUNCTION DEFINITIONS *
*****************************/

void InitCUDA( CDialog *Caller )
{
    CUresult Rslt;          // <== ..error return from CUDA Driver API function calls..

    CUcontext Ctx;          // <== ..the CUDA Context..
    CUmodule KMod;          // <== ..the kernel Module handle (which I never get)..
    CUdevice DevHndle;      // <== ..the CUDA device handle..

    unsigned Var1, Var2;    // <== parameters for a call to cuDeviceComputeCapability()..

#ifdef DBG_PTX
    CString DBG_MsgBxFmt;
#endif

//  FYI: GenError() is a function (defined elsewhere) that issues a longjmp(), so does NOT return..

    if ( *cuErrMsg ) GenError( Caller, ERR_NO_CUDA, cuErrMsg );     // <== cuInit() was called elsewhere, and if error, filled cuErrMsg[]..

//  CALL_DRVR_API() is a macro defined above - it's just an easy way to call into the Driver API..

    CALL_DRVR_API( cuDeviceGetCount, (*cuDeviceGetCount)( (int *)&NumDevs ) );      // <== ..works: no error..

    if ( !NumDevs ) GenError( Caller, ERR_NO_DEVS );    // <== ..works: no error..

    if ( !( LoadOptVals[ OptNdx_INFO_BUFF ] = new char [ PTXINFLOGSZE ] ) ||
         !( LoadOptVals[ OptNdx_ERR_BUFF ] = new char [ PTXERRLOGSZE ] )     ) GenError( Caller, ERR_NO_MEM );  // <== ..works: no error..

//  for simplicity, this code only attempts access to the 'first' GPU that it finds (I only have one in my system anyway)..

    CALL_DRVR_API( cuDeviceGet, (*cuDeviceGet)( &DevHndle, 0 ) );   // <== ..works: no error..
    CALL_DRVR_API( cuDeviceComputeCapability, (*cuDeviceComputeCapability)( (int *)&Var1, (int *)&Var2, DevHndle ) );   // <== ..works: no error..

    if ( Var1 < 2 ) GenError( Caller, ERR_OLD_DEV );    // <== CUDA Compute Capability must be > 2.0 ==> works: no error..

    CALL_DRVR_API( cuCtxCreate, (*cuCtxCreate)( &Ctx, CU_CTX_SCHED_BLOCKING_SYNC | CU_CTX_MAP_HOST, DevHndle ) );   // <== ..works: no error..

#ifdef DBG_PTX
    DBG_MsgBxFmt.Format( "About to attempt compilation of the following PTX Source Code:

%s", MiniKern );
    Caller->MessageBox( DBG_MsgBxFmt, "FYI");
#endif

    CALL_DRVR_API( cuModuleLoadDataEx, (*cuModuleLoadDataEx)( &KMod, (LPCTSTR)MiniKern, Elems( LoadOpts ), LoadOpts, LoadOptVals ) );

//  BOOM !! - the above call ALWAYS returns an error code of CUDA_ERROR_NO_BINARY_FOR_GPU !!

//  ...[ more code ]..

//  Cleanup:

    CALL_DRVR_API( cuModuleUnload, (*cuModuleUnload)( KMod ) );
    CALL_DRVR_API( cuCtxPopCurrent, (*cuCtxPopCurrent)( &Ctx ) );

    FREE_ARRAY_PTR( LoadOptVals[ OptNdx_INFO_BUFF ] )
    FREE_ARRAY_PTR( LoadOptVals[ OptNdx_ERR_BUFF ] )
}

This forum’s editor knocked out most of my first attempt at this reply, so I’m having to type it in again (from memory, excuse the pun)… What a pain…

Worth mentioning in the above is the fact that the ‘CU_JIT_THREADS_PER_BLOCK’ parameter to the cuModuleLoadDataEx() function is commented out. That’s because when it’s commented back in, the function call returns CUDA_ERROR_INVALID_VALUE instead of the usual CUDA_ERROR_NO_BINARY_FOR_GPU. Seeing as how a value of 1024 for ‘threads per block’ is fairly standard in CUDA, this might be some sort of clue as to what’s going on, but I can’t make heads or tails of it, other than to suggest the possibility that the cuModuleLoadDataEx() function is expecting an array of 64-bit pointers - is that even possible?

The driver I’m linking to is:

Filename: nvcuda32.dll
File description: NVIDIA CUDA Driver, Version 307.21
Product Name: NVIDIA CUDA 5.0.1 driver
File version: 8.17.13.721
Product version: 8.17.13.0721
File size: 7,697,768 bytes (7.34 MB)

which would seem to strongly imply that it’s using 32-bit pointers…

Another fairly remote possibility is that the cuModuleLoadDataEx() function is expecting ‘pinned’ memory for all of its parameters. But I’ve already tried passing in a ‘pinned’ memory kernel, and it didn’t change a thing, so…

Carl: I’m going to take a good, long, hard look at the source code you provided a link to, but I thought I’d get the ‘let’s see the source code’ part out of the way first…

allanmac: Excellent suggestion, so I tried it. Since the docs say that CU_JIT_TARGET_FROM_CUCCONTEXT is the default, I simply commented out the CU_JIT_TARGET parameter (in all three places, rest assured), but still got the dreaded CUDA_ERROR_NO_BINARY_FOR_GPU…

Thanks again to all who’ve responded…

The PTX you listed in the old posting has an error:

> ptxas -m 32 -arch sm_21 add.ptx
ptxas add.ptx, line 25; error   : Special register argument not allowed for instruction 'mad.lo'
ptxas add.ptx, line 25; error   : Special register argument not allowed for instruction 'mad.lo'
ptxas add.ptx, line 25; error   : Special register argument not allowed for instruction 'mad.lo'
ptxas fatal   : Ptx assembly aborted due to errors

Line 25 being:

mad.lo.u32 ndx, %ctaid.x, %ntid.x, %tid.x;

Loading special registers into real registers will fix this:

mov.u32 cta,  %ctaid.x;
        mov.u32 ntid, %ntid.x;
        mov.u32 tidx, %tid.x;
        
	mad.lo.u32 ndx, cta, ntid, tidx;

Which then compiles to:

> ptxas -v -m 32 -arch sm_21 add.ptx
ptxas info    : Compiling entry function 'AddVec' for 'sm_21'
ptxas info    : Function properties for AddVec
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 4 registers, 32 bytes cmem[0], 12 bytes cmem[14]

A colleague familiar with this API provided the following debugging tips:

  1. Check the error message coming back from cuModuleLoadDataEx. There can be useful information in there that describes what is wrong. You have to set up the error buffer manually.

  2. Take the PTX and feed it ptxas with a similar set of arguments. Sometimes the driver compiler doesn’t give as useful error messages as ptxas does

It seems allanmac already applied the second suggestion above.

Thanks Allen (I can call you that, yeah?). So the updated code is:

#include "stdafx.h"
#include "CUDA.h"       // <== includes: #include "cuda_drvapi_dynlink_cuda.h"

extern "C" {
#include "cuda_drvapi_dynlink.c"    // <== ..to dynamically link in the 4.0 CUDA API..
}

/***************
* CONFIGURATION *
***************/

#define PTXINFLOGSZE    1024    // <== Size, in bytes, of the buffer to hold informational log messages from the PTX assembly..
#define PTXERRLOGSZE    1024    // <== Size, in bytes, of the buffer to hold any error log messages from the PTX assembly....

/**************
* ENUMERATIONS *
**************/

enum    // These are the named indeces of the LoadOpts[] array, and as such, MUST exactly match the number and order of the elements of that array..
{
    OptNdx_MAX_REGS,
//  OptNdx_BLK_THREADS,
    OptNdx_INFO_BUFF,
    OptNdx_INFO_BUFF_SZE,
    OptNdx_ERR_BUFF,
    OptNdx_ERR_BUFF_SZE,
    OptNdx_TARGET,
    OptNdx_FALLBACK
};

/**************
* LOCAL MACROS *
**************/

#define Elems( array )  (sizeof(array)/sizeof(*(array)))

#define CALL_DRVR_API( func, call ) 
if ( (Rslt = (call)) != CUDA_SUCCESS ) GenError( Caller, ERR_CUDACALL, #func, Rslt )

#define FREE_ARRAY_PTR( A ) 
if ( A ) { delete [] A; A = NULL; }

/*****************
* LOCAL CONSTANTS *
*****************/

static CUjit_option LoadOpts[] =
{
    CU_JIT_MAX_REGISTERS,               // <== (unsigned int) input specifies the maximum number of registers per thread..
//  CU_JIT_THREADS_PER_BLOCK,           // <== (unsigned int) input specifies number of threads per block to target compilation for..
    CU_JIT_INFO_LOG_BUFFER,             // <== (char *) input is a pointer to a buffer in which to print any informational log messages from PTX assembly..
    CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES,  // <== (unsigned int) input is the size in bytes of the buffer; output is the number of bytes filled with messages..
    CU_JIT_ERROR_LOG_BUFFER,            // <== (char *) input is a pointer to a buffer in which to print any error log messages from PTX assembly..
    CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES, // <== (unsigned int) input is the size in bytes of the buffer; output is the number of bytes filled with messages..
    CU_JIT_TARGET,                      // <== (unsigned int for enumerated type CUjit_target_enum) input is the compilation target based on supplied CUjit_target_enum..
    CU_JIT_FALLBACK_STRATEGY            // <== (unsigned int for enumerated type CUjit_fallback_enum) chooses fallback strategy if matching cubin is not found..
};

static void *LoadOptVals[ Elems( LoadOpts ) ] =
{
    (void *)32,                     // <== (unsigned int) The maximum number of registers per thread to use for this kernel..
//  (void *)1024,                   // <== (unsigned int) The number of threads per block; output returns the number of threads the compiler actually targeted..
    (void *)NULL,                   // <== (char *) Pointer to a dynamically allocated buffer which holds any informational log messages from the PTX assembly..
    (void *)PTXINFLOGSZE,           // <== (unsigned int) Size, in bytes, of the aforementioned informational log message buffer; output is the number of bytes filled with messages..
    (void *)NULL,                   // <== (char *) Pointer to a dynamically allocated buffer which holds any error log messages from the PTX assembly..
    (void *)PTXERRLOGSZE,           // <== (unsigned int) Size, in bytes, of the aforementioned error log message buffer; output is the number of bytes filled with messages..
    (void *)CU_TARGET_COMPUTE_21,   // <== (unsigned int) Designates the intended 'compilation target'..
    (void *)CU_PREFER_PTX           // <== (unsigned int) Designates that the 'fallback strategy', if no matching cubin is found, is to assume PTX..
};

// This is the actual kernel I'm trying to load:

static char MiniKern[] =
".version 2.2
"
".target sm_20
"
".address_size 32

"
".global .u32 Frst[512];
"
".global .u32 Scnd[512];
"
".global .u32 USum[512];

"
".entry AddVec()
"
"{
"
"	.reg .u32 ndx;

"
"	.reg .u32 AM_ctaid;
"
"	.reg .u32 AM_ntid;
"
"	.reg .u32 AM_tidx;

"
"	.reg .u32 Fptr;
"
"	.reg .u32 Sptr;
"
"	.reg .u32 Uptr;

"
"	.reg .u32 Fval;
"
"	.reg .u32 Sval;
"
"	.reg .u32 Uval;

"
"	mov.u32 AM_ctaid,  %ctaid.x;
"
"	mov.u32 AM_ntid, %ntid.x;
"
"	mov.u32 AM_tidx, %tid.x;

"
"	cvta.global.u32 Fptr, Frst;
"
"	cvta.global.u32 Sptr, Scnd;
"
"	cvta.global.u32 Uptr, USum;

"
"	mad.lo.u32 ndx, AM_ctaid, AM_ntid, AM_tidx;
"
"	shl.b32 ndx, ndx, 2;

"
"	add.u32 Fptr, Fptr, ndx;
"
"	add.u32 Sptr, Sptr, ndx;
"
"	add.u32 Uptr, Uptr, ndx;

"
"	ld.global.u32 Fval, [Fptr];
"
"	ld.global.u32 Sval, [Sptr];

"
"	add.u32 Uval, Fval, Sval;
"
"	st.global.u32 [Uptr], Uval;

"
"	ret.uni;
"
"}
";

/*****************
* LOCAL VARIABLES *
*****************/

static unsigned NumDevs;    // <== total number of NVidia GPUs in the system, usable or not..

/********************
* EXTERNAL VARIABLES *
********************/

extern char cuErrMsg[];     // <== cuInit() is called elsewhere, and on error, fills this with an ASCII error message..

/*****************************
* GLOBAL FUNCTION DEFINITIONS *
*****************************/

void InitCUDA( CDialog *Caller )
{
    CUresult Rslt;          // <== ..error return from CUDA Driver API function calls..

    CUcontext Ctx;          // <== ..the CUDA Context..
    CUmodule KMod;          // <== ..the kernel Module handle (which I never get)..
    CUdevice DevHndle;      // <== ..the CUDA device handle..

    unsigned Var1, Var2;    // <== parameters for a call to cuDeviceComputeCapability()..

#ifdef DBG_PTX
    CString DBG_MsgBxFmt;
#endif

//  FYI: GenError() is a function (defined elsewhere) that issues a longjmp(), so does NOT return..

    if ( *cuErrMsg ) GenError( Caller, ERR_NO_CUDA, cuErrMsg );     // <== cuInit() was called elsewhere, and if error, filled cuErrMsg[]..

//  CALL_DRVR_API() is a macro defined above - it's just an easy way to call into the Driver API..

    CALL_DRVR_API( cuDeviceGetCount, (*cuDeviceGetCount)( (int *)&NumDevs ) );      // <== ..works: no error..

    if ( !NumDevs ) GenError( Caller, ERR_NO_DEVS );    // <== ..works: no error..

    if ( !( LoadOptVals[ OptNdx_INFO_BUFF ] = new char [ PTXINFLOGSZE ] ) ||
         !( LoadOptVals[ OptNdx_ERR_BUFF ] = new char [ PTXERRLOGSZE ] )     ) GenError( Caller, ERR_NO_MEM );  // <== ..works: no error..

//  for simplicity, this code only attempts access to the 'first' GPU that it finds (I only have one in my system anyway)..

    CALL_DRVR_API( cuDeviceGet, (*cuDeviceGet)( &DevHndle, 0 ) );   // <== ..works: no error..
    CALL_DRVR_API( cuDeviceComputeCapability, (*cuDeviceComputeCapability)( (int *)&Var1, (int *)&Var2, DevHndle ) );   // <== ..works: no error..

    if ( Var1 < 2 ) GenError( Caller, ERR_OLD_DEV );    // <== CUDA Compute Capability must be > 2.0 ==> works: no error..

    CALL_DRVR_API( cuCtxCreate, (*cuCtxCreate)( &Ctx, CU_CTX_SCHED_BLOCKING_SYNC | CU_CTX_MAP_HOST, DevHndle ) );   // <== ..works: no error..

#ifdef DBG_PTX
    DBG_MsgBxFmt.Format( "About to attempt compilation of the following PTX Source Code:

%s", MiniKern );
    Caller->MessageBox( DBG_MsgBxFmt, "FYI");
#endif

    CALL_DRVR_API( cuModuleLoadDataEx, (*cuModuleLoadDataEx)( &KMod, (LPCTSTR)MiniKern, Elems( LoadOpts ), LoadOpts, LoadOptVals ) );

//  BOOM !! - the above call ALWAYS returns an error code of CUDA_ERROR_NO_BINARY_FOR_GPU !!

//  ...[ more code ]..

//  Cleanup:

    CALL_DRVR_API( cuModuleUnload, (*cuModuleUnload)( KMod ) );
    CALL_DRVR_API( cuCtxPopCurrent, (*cuCtxPopCurrent)( &Ctx ) );

    FREE_ARRAY_PTR( LoadOptVals[ OptNdx_INFO_BUFF ] )
    FREE_ARRAY_PTR( LoadOptVals[ OptNdx_ERR_BUFF ] )
}

You’ll probably notice that I renamed the local regs with a prefix (your initials), mostly because at this point, I don’t trust the compiler not to confuse “ntid” with “%ntid”…

So I updated my real source code with that change, re-compiled, re-executed, and…

You guessed it: CUDA_ERROR_NO_BINARY_FOR_GPU again…

Of course, that only makes sense (I think). A compilation error should have simply been written out to the buffer pointed to by LoadOptVals[ OptNdx_ERR_BUFF ], and the return value would have been… well, I don’t know - CUDA_ERROR_INVALID_SOURCE ?

njuffa: Is that what you meant by “Check the error message coming back from cuModuleLoadDataEx”? Should I be checking the contents of LoadOptVals[ OptNdx_ERR_BUFF ] after the function returns CUDA_ERROR_NO_BINARY_FOR_GPU? Up to now, I haven’t done that, because I’ve been assuming that a simple compilation error would be generating a different return value…

Does any compilation problem return CUDA_ERROR_NO_BINARY_FOR_GPU ?? If that’s the case, then, Yikes, that would strongly imply that this whole problem boils down to a simple failed compilation instead of a hardware incompatibility. That would be extremely good news…

Anyway, I’ll definitely have to look at the contents of that buffer - I’ll keep you all posted…

You still have compilation errors. You should test any PTX you write with ptxas with the appropriate target architecture.

There were two issues:

ptxas add2.ptx, line 5; error   : Feature '.address_size directive' requires PTX ISA .version 2.3 or later

and

ptxas add2.ptx, line 82; fatal   : Parsing error near ';': syntax error

Yes, when we cuModuleLoadDataEx and there are compilation issues the error returned is CUDA_ERROR_NO_BINARY_FOR_GPU. In our code if there is any non success return code returned we spit out the error buffer for debugging (I believe this is LoadOptVals[ OptNdx_ERR_BUFF ] in your code).

I believe what goes on under the hood (merely guessing at this point) is that CUBINs can contain code for many different architectures (say SASS 1.0, SASS 2.0, PTX 1.0, PTX 2.0). It will search the available code and try to find one that best matches the GPU. This could be SASS device code or PTX. In this case there is only one option, the PTX you gave it. Probably after trying to compile the PTX and failing the number of candidates for running device code is zero and you get CUDA_ERROR_NO_BINARY_FOR_GPU.

Hurrah!! Hurrah!! Gentlemen,

!! THIS PROBLEM IS FINALLY SOLVED !!

All this time, I was thinking that the CUDA_ERROR_NO_BINARY_FOR_GPU error was a hardware-related problem, because the docs specifically state:

“This indicates that there is no kernel image available that is suitable for the device. This can occur when a user specifies code generation options for a particular CUDA source file that do not include the corresponding device configuration.”

which, as it turns out, is bizzaranese for “Your kernel has one or more compiler errors”…

This I’ve confirmed by taking the sage advice of one “njuffa” (who, not incidentally, has helped me once before with a better understanding of an “oversubscription factor”), and checked the contents of my LoadOptVals[ OptNdx_ERR_BUFF ] buffer after getting CUDA_ERROR_NO_BINARY_FOR_GPU, and

!! Lo and behold !! (excuse my excitement)

I’m now getting meaningful PTX compiler errors!! Yay!

Thank you, thank you one and all…

…and now we return to our regularly scheduled programming…

Glad to hear it all worked out. I can’t take any credit here, I was merely forwarding advice from a person who has hands-on experience with this particular API.