Help!! I can't get my NVidia GeForce GT 525M to load in a single CUDA PTX kernel!!

Billy_Rubin · November 15, 2012, 2:41am

Try as I might, I just can’t get my NVidia GeForce GT 525M to load in a single CUDA PTX kernel using the CUDA Driver API!!

How and where can I get professional support for this problem?

Every time my 32-bit Windows program atempts to load an ASCII, zero-terminated PTX kernel into the GPU, using the CUDA Driver API function, cuModuleLoadDataEx(), it always returns CUDA_ERROR_NO_BINARY_FOR_GPU.

It’s not a linkage problem, because previous calls to the CUDA Driver API functions, cuInit(), cuDeviceGetCount(), cuDeviceGet(), cuDeviceComputeCapability(), cuDeviceGetAttribute(), and cuCtxCreate() all work great (they return no error).

I originally thought that I must be doing something wrong, but I’ve checked my program code, re-checked it, then re-checked it again, and again, and again, and… well, you get the idea. This problem is nine months old, and I’ve tried every permutation I can think of.

The parameters I’m using (as of this writing) to load in the kernel are:

CU_JIT_MAX_REGISTERS = 32
CU_JIT_INFO_LOG_BUFFER = new char [1024]
CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES = 1024
CU_JIT_ERROR_LOG_BUFFER = new char [1024]
CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES = 1024
CU_JIT_TARGET = CU_TARGET_COMPUTE_21
CU_JIT_FALLBACK_STRATEGY = CU_PREFER_PTX

Nine months ago, I originally posted the problem here:

[url]CUDA_ERROR_NO_BINARY_FOR_GPU loading PTX ? Can't load PTX 'image' no matter what I do.. - CUDA Programming and Performance - NVIDIA Developer Forums

and got one reply, but no answers.

I really need an answer to this, because being able to finish and test the software on this relatively tiny CUDA-based GPU is basically the sole determinant for the feasibility of buying what will probably be a very expensive system, replete with multiple high-end Tesla cards. In short, how can I shell out $12,000 or more for that kind of hardware, when I don’t even have the software to use it?

So is there any billable support mechanism for NVidia CUDA programming, where I can get a definitive answer to this dilemma?

If it helps, here are more specifics on this seemingly insurmountable problem:

Machine: Dell Inspiron N7110
Chipset: GeForce GT 525M, using NVidia Optimus (Is Optimus the problem?)
OS: Windows 7 64-bit Professional, latest build

Of course, I can provide lot’s more specifics on the problem, but I think it best not to get ahead of myself here…

Can anyone offer any usable advice as to what may be wrong, or who/where I can pay for an answer?

Thanks in advance.

Mark_Harris · November 15, 2012, 6:59am

I have seen this error before when I passed invalid PTX code to cuModuleLoadDataEx. But it also looks like you may be specifying the options wrong.

If you post the actual API code you are running we might be able to help more.

Carl_Ponder · November 15, 2012, 12:15pm

The LAMMPS/GPU package works this way, the source code is public and you can download it from

http://lammps.sandia.gov/tars/lammps-14Oct12.tar.gz

for example. I think the main file you would need to look at is

LAMMPS/lib/gpu/geryon/nvd_kernel.h

You may be able to distil all the details for PTX load from this, I’m not sure. Since you’re running on Windows you wouldn’t actually be able to build LAMMPS to be sure that the steps are sufficient in your environment.

allanmac · November 15, 2012, 7:33pm

If you’re handing in a PTX file with a compatible .target then why not use the CU_JIT_TARGET_FROM_CUCCONTEXT instead of explicitly stating a compute target?

Also, is your PTX hand-written? Are you sure it’s ASCII?

For what it’s worth, I’ve had good results across pretty much all architectures and drivers loading cubins in fatbins via cuModuleLoadFatBinary(). Unfortunately, I never tried embedding PTX’s. :|

Billy_Rubin · November 15, 2012, 7:47pm

Many thanks to all three of you for your replies.

Mark: Well, I knew that if anyone replied (and that was far from certain) that at least one of them would ask to see the source code (it’s what I would do). So, because the source code itself includes a whole mess of stuff irrelevant to the problem, I created the following, which is essentially what the program does, without the other, entirely superfluous code:

#include "stdafx.h"
#include "CUDA.h"       // <== includes: #include "cuda_drvapi_dynlink_cuda.h"

extern "C" {
#include "cuda_drvapi_dynlink.c"    // <== ..to dynamically link in the 4.0 CUDA API..
}

/***************
* CONFIGURATION *
***************/

#define PTXINFLOGSZE    1024    // <== Size, in bytes, of the buffer to hold informational log messages from the PTX assembly..
#define PTXERRLOGSZE    1024    // <== Size, in bytes, of the buffer to hold any error log messages from the PTX assembly....

/**************
* ENUMERATIONS *
**************/

enum    // These are the named indeces of the LoadOpts[] array, and as such, MUST exactly match the number and order of the elements of that array..
{
    OptNdx_MAX_REGS,
//  OptNdx_BLK_THREADS,
    OptNdx_INFO_BUFF,
    OptNdx_INFO_BUFF_SZE,
    OptNdx_ERR_BUFF,
    OptNdx_ERR_BUFF_SZE,
    OptNdx_TARGET,
    OptNdx_FALLBACK
};

/**************
* LOCAL MACROS *
**************/

#define Elems( array )  (sizeof(array)/sizeof(*(array)))

#define CALL_DRVR_API( func, call ) 
if ( (Rslt = (call)) != CUDA_SUCCESS ) GenError( Caller, ERR_CUDACALL, #func, Rslt )

#define FREE_ARRAY_PTR( A ) 
if ( A ) { delete [] A; A = NULL; }

/*****************
* LOCAL CONSTANTS *
*****************/

static CUjit_option LoadOpts[] =
{
    CU_JIT_MAX_REGISTERS,               // <== (unsigned int) input specifies the maximum number of registers per thread..
//  CU_JIT_THREADS_PER_BLOCK,           // <== (unsigned int) input specifies number of threads per block to target compilation for..
    CU_JIT_INFO_LOG_BUFFER,             // <== (char *) input is a pointer to a buffer in which to print any informational log messages from PTX assembly..
    CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES,  // <== (unsigned int) input is the size in bytes of the buffer; output is the number of bytes filled with messages..
    CU_JIT_ERROR_LOG_BUFFER,            // <== (char *) input is a pointer to a buffer in which to print any error log messages from PTX assembly..
    CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES, // <== (unsigned int) input is the size in bytes of the buffer; output is the number of bytes filled with messages..
    CU_JIT_TARGET,                      // <== (unsigned int for enumerated type CUjit_target_enum) input is the compilation target based on supplied CUjit_target_enum..
    CU_JIT_FALLBACK_STRATEGY            // <== (unsigned int for enumerated type CUjit_fallback_enum) chooses fallback strategy if matching cubin is not found..
};

static void *LoadOptVals[ Elems( LoadOpts ) ] =
{
    (void *)32,                     // <== (unsigned int) The maximum number of registers per thread to use for this kernel..
//  (void *)1024,                   // <== (unsigned int) The number of threads per block; output returns the number of threads the compiler actually targeted..
    (void *)NULL,                   // <== (char *) Pointer to a dynamically allocated buffer which holds any informational log messages from the PTX assembly..
    (void *)PTXINFLOGSZE,           // <== (unsigned int) Size, in bytes, of the aforementioned informational log message buffer; output is the number of bytes filled with messages..
    (void *)NULL,                   // <== (char *) Pointer to a dynamically allocated buffer which holds any error log messages from the PTX assembly..
    (void *)PTXERRLOGSZE,           // <== (unsigned int) Size, in bytes, of the aforementioned error log message buffer; output is the number of bytes filled with messages..
    (void *)CU_TARGET_COMPUTE_21,   // <== (unsigned int) Designates the intended 'compilation target'..
    (void *)CU_PREFER_PTX           // <== (unsigned int) Designates that the 'fallback strategy', if no matching cubin is found, is to assume PTX..
};

// This is the actual kernel I'm trying to load:

static char MiniKern[] =
".version 2.2
"
".target sm_20
"
".address_size 32

"
".global .u32 Frst[512];
"
".global .u32 Scnd[512];
"
".global .u32 USum[512];

"
".entry AddVec()
"
"{
"
"	.reg .u32 ndx;

"
"	.reg .u32 Fptr;
"
"	.reg .u32 Sptr;
"
"	.reg .u32 Uptr;

"
"	.reg .u32 Fval;
"
"	.reg .u32 Sval;
"
"	.reg .u32 Uval;

"
"	cvta.global.u32 Fptr, Frst;
"
"	cvta.global.u32 Sptr, Scnd;
"
"	cvta.global.u32 Uptr, USum;

"
"	mad.lo.u32 ndx, %ctaid.x, %ntid.x, %tid.x;
"
"	shl.b32 ndx, ndx, 2;

"
"	add.u32 Fptr, Fptr, ndx;
"
"	add.u32 Sptr, Sptr, ndx;
"
"	add.u32 Uptr, Uptr, ndx;

"
"	ld.global.u32 Fval, [Fptr];
"
"	ld.global.u32 Sval, [Sptr];

"
"	add.u32 Uval, Fval, Sval;
"
"	st.global.u32 [Uptr], Uval;

"
"	ret.uni;
"
"}
";

/*****************
* LOCAL VARIABLES *
*****************/

static unsigned NumDevs;    // <== total number of NVidia GPUs in the system, usable or not..

/********************
* EXTERNAL VARIABLES *
********************/

extern char cuErrMsg[];     // <== cuInit() is called elsewhere, and on error, fills this with an ASCII error message..

/*****************************
* GLOBAL FUNCTION DEFINITIONS *
*****************************/

void InitCUDA( CDialog *Caller )
{
    CUresult Rslt;          // <== ..error return from CUDA Driver API function calls..

    CUcontext Ctx;          // <== ..the CUDA Context..
    CUmodule KMod;          // <== ..the kernel Module handle (which I never get)..
    CUdevice DevHndle;      // <== ..the CUDA device handle..

    unsigned Var1, Var2;    // <== parameters for a call to cuDeviceComputeCapability()..

#ifdef DBG_PTX
    CString DBG_MsgBxFmt;
#endif

//  FYI: GenError() is a function (defined elsewhere) that issues a longjmp(), so does NOT return..

    if ( *cuErrMsg ) GenError( Caller, ERR_NO_CUDA, cuErrMsg );     // <== cuInit() was called elsewhere, and if error, filled cuErrMsg[]..

//  CALL_DRVR_API() is a macro defined above - it's just an easy way to call into the Driver API..

    CALL_DRVR_API( cuDeviceGetCount, (*cuDeviceGetCount)( (int *)&NumDevs ) );      // <== ..works: no error..

    if ( !NumDevs ) GenError( Caller, ERR_NO_DEVS );    // <== ..works: no error..

    if ( !( LoadOptVals[ OptNdx_INFO_BUFF ] = new char [ PTXINFLOGSZE ] ) ||
         !( LoadOptVals[ OptNdx_ERR_BUFF ] = new char [ PTXERRLOGSZE ] )     ) GenError( Caller, ERR_NO_MEM );  // <== ..works: no error..

//  for simplicity, this code only attempts access to the 'first' GPU that it finds (I only have one in my system anyway)..

    CALL_DRVR_API( cuDeviceGet, (*cuDeviceGet)( &DevHndle, 0 ) );   // <== ..works: no error..
    CALL_DRVR_API( cuDeviceComputeCapability, (*cuDeviceComputeCapability)( (int *)&Var1, (int *)&Var2, DevHndle ) );   // <== ..works: no error..

    if ( Var1 < 2 ) GenError( Caller, ERR_OLD_DEV );    // <== CUDA Compute Capability must be > 2.0 ==> works: no error..

    CALL_DRVR_API( cuCtxCreate, (*cuCtxCreate)( &Ctx, CU_CTX_SCHED_BLOCKING_SYNC | CU_CTX_MAP_HOST, DevHndle ) );   // <== ..works: no error..

#ifdef DBG_PTX
    DBG_MsgBxFmt.Format( "About to attempt compilation of the following PTX Source Code:

%s", MiniKern );
    Caller->MessageBox( DBG_MsgBxFmt, "FYI");
#endif

    CALL_DRVR_API( cuModuleLoadDataEx, (*cuModuleLoadDataEx)( &KMod, (LPCTSTR)MiniKern, Elems( LoadOpts ), LoadOpts, LoadOptVals ) );

//  BOOM !! - the above call ALWAYS returns an error code of CUDA_ERROR_NO_BINARY_FOR_GPU !!

//  ...[ more code ]..

//  Cleanup:

    CALL_DRVR_API( cuModuleUnload, (*cuModuleUnload)( KMod ) );
    CALL_DRVR_API( cuCtxPopCurrent, (*cuCtxPopCurrent)( &Ctx ) );

    FREE_ARRAY_PTR( LoadOptVals[ OptNdx_INFO_BUFF ] )
    FREE_ARRAY_PTR( LoadOptVals[ OptNdx_ERR_BUFF ] )
}

This forum’s editor knocked out most of my first attempt at this reply, so I’m having to type it in again (from memory, excuse the pun)… What a pain…

Worth mentioning in the above is the fact that the ‘CU_JIT_THREADS_PER_BLOCK’ parameter to the cuModuleLoadDataEx() function is commented out. That’s because when it’s commented back in, the function call returns CUDA_ERROR_INVALID_VALUE instead of the usual CUDA_ERROR_NO_BINARY_FOR_GPU. Seeing as how a value of 1024 for ‘threads per block’ is fairly standard in CUDA, this might be some sort of clue as to what’s going on, but I can’t make heads or tails of it, other than to suggest the possibility that the cuModuleLoadDataEx() function is expecting an array of 64-bit pointers - is that even possible?

The driver I’m linking to is:

Filename: nvcuda32.dll
File description: NVIDIA CUDA Driver, Version 307.21
Product Name: NVIDIA CUDA 5.0.1 driver
File version: 8.17.13.721
Product version: 8.17.13.0721
File size: 7,697,768 bytes (7.34 MB)

which would seem to strongly imply that it’s using 32-bit pointers…

Another fairly remote possibility is that the cuModuleLoadDataEx() function is expecting ‘pinned’ memory for all of its parameters. But I’ve already tried passing in a ‘pinned’ memory kernel, and it didn’t change a thing, so…

Carl: I’m going to take a good, long, hard look at the source code you provided a link to, but I thought I’d get the ‘let’s see the source code’ part out of the way first…

allanmac: Excellent suggestion, so I tried it. Since the docs say that CU_JIT_TARGET_FROM_CUCCONTEXT is the default, I simply commented out the CU_JIT_TARGET parameter (in all three places, rest assured), but still got the dreaded CUDA_ERROR_NO_BINARY_FOR_GPU…

Thanks again to all who’ve responded…

allanmac · November 15, 2012, 7:47pm

The PTX you listed in the old posting has an error:

> ptxas -m 32 -arch sm_21 add.ptx
ptxas add.ptx, line 25; error   : Special register argument not allowed for instruction 'mad.lo'
ptxas add.ptx, line 25; error   : Special register argument not allowed for instruction 'mad.lo'
ptxas add.ptx, line 25; error   : Special register argument not allowed for instruction 'mad.lo'
ptxas fatal   : Ptx assembly aborted due to errors

Line 25 being:

mad.lo.u32 ndx, %ctaid.x, %ntid.x, %tid.x;

Loading special registers into real registers will fix this:

mov.u32 cta,  %ctaid.x;
        mov.u32 ntid, %ntid.x;
        mov.u32 tidx, %tid.x;
        
	mad.lo.u32 ndx, cta, ntid, tidx;

Which then compiles to:

> ptxas -v -m 32 -arch sm_21 add.ptx
ptxas info    : Compiling entry function 'AddVec' for 'sm_21'
ptxas info    : Function properties for AddVec
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 4 registers, 32 bytes cmem[0], 12 bytes cmem[14]

njuffa · November 15, 2012, 8:53pm

A colleague familiar with this API provided the following debugging tips:

Check the error message coming back from cuModuleLoadDataEx. There can be useful information in there that describes what is wrong. You have to set up the error buffer manually.
Take the PTX and feed it ptxas with a similar set of arguments. Sometimes the driver compiler doesn’t give as useful error messages as ptxas does

It seems allanmac already applied the second suggestion above.

Billy_Rubin · November 15, 2012, 9:34pm

Thanks Allen (I can call you that, yeah?). So the updated code is:

#include "stdafx.h"
#include "CUDA.h"       // <== includes: #include "cuda_drvapi_dynlink_cuda.h"

extern "C" {
#include "cuda_drvapi_dynlink.c"    // <== ..to dynamically link in the 4.0 CUDA API..
}

/***************
* CONFIGURATION *
***************/

#define PTXINFLOGSZE    1024    // <== Size, in bytes, of the buffer to hold informational log messages from the PTX assembly..
#define PTXERRLOGSZE    1024    // <== Size, in bytes, of the buffer to hold any error log messages from the PTX assembly....

/**************
* ENUMERATIONS *
**************/

enum    // These are the named indeces of the LoadOpts[] array, and as such, MUST exactly match the number and order of the elements of that array..
{
    OptNdx_MAX_REGS,
//  OptNdx_BLK_THREADS,
    OptNdx_INFO_BUFF,
    OptNdx_INFO_BUFF_SZE,
    OptNdx_ERR_BUFF,
    OptNdx_ERR_BUFF_SZE,
    OptNdx_TARGET,
    OptNdx_FALLBACK
};

/**************
* LOCAL MACROS *
**************/

#define Elems( array )  (sizeof(array)/sizeof(*(array)))

#define CALL_DRVR_API( func, call ) 
if ( (Rslt = (call)) != CUDA_SUCCESS ) GenError( Caller, ERR_CUDACALL, #func, Rslt )

#define FREE_ARRAY_PTR( A ) 
if ( A ) { delete [] A; A = NULL; }

/*****************
* LOCAL CONSTANTS *
*****************/

static CUjit_option LoadOpts[] =
{
    CU_JIT_MAX_REGISTERS,               // <== (unsigned int) input specifies the maximum number of registers per thread..
//  CU_JIT_THREADS_PER_BLOCK,           // <== (unsigned int) input specifies number of threads per block to target compilation for..
    CU_JIT_INFO_LOG_BUFFER,             // <== (char *) input is a pointer to a buffer in which to print any informational log messages from PTX assembly..
    CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES,  // <== (unsigned int) input is the size in bytes of the buffer; output is the number of bytes filled with messages..
    CU_JIT_ERROR_LOG_BUFFER,            // <== (char *) input is a pointer to a buffer in which to print any error log messages from PTX assembly..
    CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES, // <== (unsigned int) input is the size in bytes of the buffer; output is the number of bytes filled with messages..
    CU_JIT_TARGET,                      // <== (unsigned int for enumerated type CUjit_target_enum) input is the compilation target based on supplied CUjit_target_enum..
    CU_JIT_FALLBACK_STRATEGY            // <== (unsigned int for enumerated type CUjit_fallback_enum) chooses fallback strategy if matching cubin is not found..
};

static void *LoadOptVals[ Elems( LoadOpts ) ] =
{
    (void *)32,                     // <== (unsigned int) The maximum number of registers per thread to use for this kernel..
//  (void *)1024,                   // <== (unsigned int) The number of threads per block; output returns the number of threads the compiler actually targeted..
    (void *)NULL,                   // <== (char *) Pointer to a dynamically allocated buffer which holds any informational log messages from the PTX assembly..
    (void *)PTXINFLOGSZE,           // <== (unsigned int) Size, in bytes, of the aforementioned informational log message buffer; output is the number of bytes filled with messages..
    (void *)NULL,                   // <== (char *) Pointer to a dynamically allocated buffer which holds any error log messages from the PTX assembly..
    (void *)PTXERRLOGSZE,           // <== (unsigned int) Size, in bytes, of the aforementioned error log message buffer; output is the number of bytes filled with messages..
    (void *)CU_TARGET_COMPUTE_21,   // <== (unsigned int) Designates the intended 'compilation target'..
    (void *)CU_PREFER_PTX           // <== (unsigned int) Designates that the 'fallback strategy', if no matching cubin is found, is to assume PTX..
};

// This is the actual kernel I'm trying to load:

static char MiniKern[] =
".version 2.2
"
".target sm_20
"
".address_size 32

"
".global .u32 Frst[512];
"
".global .u32 Scnd[512];
"
".global .u32 USum[512];

"
".entry AddVec()
"
"{
"
"	.reg .u32 ndx;

"
"	.reg .u32 AM_ctaid;
"
"	.reg .u32 AM_ntid;
"
"	.reg .u32 AM_tidx;

"
"	.reg .u32 Fptr;
"
"	.reg .u32 Sptr;
"
"	.reg .u32 Uptr;

"
"	.reg .u32 Fval;
"
"	.reg .u32 Sval;
"
"	.reg .u32 Uval;

"
"	mov.u32 AM_ctaid,  %ctaid.x;
"
"	mov.u32 AM_ntid, %ntid.x;
"
"	mov.u32 AM_tidx, %tid.x;

"
"	cvta.global.u32 Fptr, Frst;
"
"	cvta.global.u32 Sptr, Scnd;
"
"	cvta.global.u32 Uptr, USum;

"
"	mad.lo.u32 ndx, AM_ctaid, AM_ntid, AM_tidx;
"
"	shl.b32 ndx, ndx, 2;

"
"	add.u32 Fptr, Fptr, ndx;
"
"	add.u32 Sptr, Sptr, ndx;
"
"	add.u32 Uptr, Uptr, ndx;

"
"	ld.global.u32 Fval, [Fptr];
"
"	ld.global.u32 Sval, [Sptr];

"
"	add.u32 Uval, Fval, Sval;
"
"	st.global.u32 [Uptr], Uval;

"
"	ret.uni;
"
"}
";

/*****************
* LOCAL VARIABLES *
*****************/

static unsigned NumDevs;    // <== total number of NVidia GPUs in the system, usable or not..

/********************
* EXTERNAL VARIABLES *
********************/

extern char cuErrMsg[];     // <== cuInit() is called elsewhere, and on error, fills this with an ASCII error message..

/*****************************
* GLOBAL FUNCTION DEFINITIONS *
*****************************/

void InitCUDA( CDialog *Caller )
{
    CUresult Rslt;          // <== ..error return from CUDA Driver API function calls..

    CUcontext Ctx;          // <== ..the CUDA Context..
    CUmodule KMod;          // <== ..the kernel Module handle (which I never get)..
    CUdevice DevHndle;      // <== ..the CUDA device handle..

    unsigned Var1, Var2;    // <== parameters for a call to cuDeviceComputeCapability()..

#ifdef DBG_PTX
    CString DBG_MsgBxFmt;
#endif

//  FYI: GenError() is a function (defined elsewhere) that issues a longjmp(), so does NOT return..

    if ( *cuErrMsg ) GenError( Caller, ERR_NO_CUDA, cuErrMsg );     // <== cuInit() was called elsewhere, and if error, filled cuErrMsg[]..

//  CALL_DRVR_API() is a macro defined above - it's just an easy way to call into the Driver API..

    CALL_DRVR_API( cuDeviceGetCount, (*cuDeviceGetCount)( (int *)&NumDevs ) );      // <== ..works: no error..

    if ( !NumDevs ) GenError( Caller, ERR_NO_DEVS );    // <== ..works: no error..

    if ( !( LoadOptVals[ OptNdx_INFO_BUFF ] = new char [ PTXINFLOGSZE ] ) ||
         !( LoadOptVals[ OptNdx_ERR_BUFF ] = new char [ PTXERRLOGSZE ] )     ) GenError( Caller, ERR_NO_MEM );  // <== ..works: no error..

//  for simplicity, this code only attempts access to the 'first' GPU that it finds (I only have one in my system anyway)..

    CALL_DRVR_API( cuDeviceGet, (*cuDeviceGet)( &DevHndle, 0 ) );   // <== ..works: no error..
    CALL_DRVR_API( cuDeviceComputeCapability, (*cuDeviceComputeCapability)( (int *)&Var1, (int *)&Var2, DevHndle ) );   // <== ..works: no error..

    if ( Var1 < 2 ) GenError( Caller, ERR_OLD_DEV );    // <== CUDA Compute Capability must be > 2.0 ==> works: no error..

    CALL_DRVR_API( cuCtxCreate, (*cuCtxCreate)( &Ctx, CU_CTX_SCHED_BLOCKING_SYNC | CU_CTX_MAP_HOST, DevHndle ) );   // <== ..works: no error..

#ifdef DBG_PTX
    DBG_MsgBxFmt.Format( "About to attempt compilation of the following PTX Source Code:

%s", MiniKern );
    Caller->MessageBox( DBG_MsgBxFmt, "FYI");
#endif

    CALL_DRVR_API( cuModuleLoadDataEx, (*cuModuleLoadDataEx)( &KMod, (LPCTSTR)MiniKern, Elems( LoadOpts ), LoadOpts, LoadOptVals ) );

//  BOOM !! - the above call ALWAYS returns an error code of CUDA_ERROR_NO_BINARY_FOR_GPU !!

//  ...[ more code ]..

//  Cleanup:

    CALL_DRVR_API( cuModuleUnload, (*cuModuleUnload)( KMod ) );
    CALL_DRVR_API( cuCtxPopCurrent, (*cuCtxPopCurrent)( &Ctx ) );

    FREE_ARRAY_PTR( LoadOptVals[ OptNdx_INFO_BUFF ] )
    FREE_ARRAY_PTR( LoadOptVals[ OptNdx_ERR_BUFF ] )
}

You’ll probably notice that I renamed the local regs with a prefix (your initials), mostly because at this point, I don’t trust the compiler not to confuse “ntid” with “%ntid”…

So I updated my real source code with that change, re-compiled, re-executed, and…

You guessed it: CUDA_ERROR_NO_BINARY_FOR_GPU again…

Of course, that only makes sense (I think). A compilation error should have simply been written out to the buffer pointed to by LoadOptVals[ OptNdx_ERR_BUFF ], and the return value would have been… well, I don’t know - CUDA_ERROR_INVALID_SOURCE ?

njuffa: Is that what you meant by “Check the error message coming back from cuModuleLoadDataEx”? Should I be checking the contents of LoadOptVals[ OptNdx_ERR_BUFF ] after the function returns CUDA_ERROR_NO_BINARY_FOR_GPU? Up to now, I haven’t done that, because I’ve been assuming that a simple compilation error would be generating a different return value…

Does any compilation problem return CUDA_ERROR_NO_BINARY_FOR_GPU ?? If that’s the case, then, Yikes, that would strongly imply that this whole problem boils down to a simple failed compilation instead of a hardware incompatibility. That would be extremely good news…

Anyway, I’ll definitely have to look at the contents of that buffer - I’ll keep you all posted…

allanmac · November 15, 2012, 9:47pm

You still have compilation errors. You should test any PTX you write with ptxas with the appropriate target architecture.

There were two issues:

ptxas add2.ptx, line 5; error   : Feature '.address_size directive' requires PTX ISA .version 2.3 or later

and

ptxas add2.ptx, line 82; fatal   : Parsing error near ';': syntax error

JBigler · November 15, 2012, 10:37pm

Yes, when we cuModuleLoadDataEx and there are compilation issues the error returned is CUDA_ERROR_NO_BINARY_FOR_GPU. In our code if there is any non success return code returned we spit out the error buffer for debugging (I believe this is LoadOptVals[ OptNdx_ERR_BUFF ] in your code).

I believe what goes on under the hood (merely guessing at this point) is that CUBINs can contain code for many different architectures (say SASS 1.0, SASS 2.0, PTX 1.0, PTX 2.0). It will search the available code and try to find one that best matches the GPU. This could be SASS device code or PTX. In this case there is only one option, the PTX you gave it. Probably after trying to compile the PTX and failing the number of candidates for running device code is zero and you get CUDA_ERROR_NO_BINARY_FOR_GPU.

Billy_Rubin · November 15, 2012, 10:49pm

Hurrah!! Hurrah!! Gentlemen,

!! THIS PROBLEM IS FINALLY SOLVED !!

All this time, I was thinking that the CUDA_ERROR_NO_BINARY_FOR_GPU error was a hardware-related problem, because the docs specifically state:

“This indicates that there is no kernel image available that is suitable for the device. This can occur when a user specifies code generation options for a particular CUDA source file that do not include the corresponding device configuration.”

which, as it turns out, is bizzaranese for “Your kernel has one or more compiler errors”…

This I’ve confirmed by taking the sage advice of one “njuffa” (who, not incidentally, has helped me once before with a better understanding of an “oversubscription factor”), and checked the contents of my LoadOptVals[ OptNdx_ERR_BUFF ] buffer after getting CUDA_ERROR_NO_BINARY_FOR_GPU, and

!! Lo and behold !! (excuse my excitement)

I’m now getting meaningful PTX compiler errors!! Yay!

Thank you, thank you one and all…

…and now we return to our regularly scheduled programming…

njuffa · November 16, 2012, 12:07am

Glad to hear it all worked out. I can’t take any credit here, I was merely forwarding advice from a person who has hands-on experience with this particular API.

Topic		Replies	Views
PTX jit spills registers in trivial programs CUDA Programming and Performance	9	785	February 28, 2024
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64156	April 20, 2011
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204313	April 13, 2009
Cannot run any CUDA kernels CUDA runtime doesn't recognize NVIDIA GPU CUDA Programming and Performance	26	12303	August 24, 2010
New Features in CUDA 7.5 Technical Blog	66	1066	August 10, 2016
concurrent D2H+H2D transfers? CUDA Programming and Performance	5	2453	May 10, 2016
Combining NVRTC and cuLibrary API CUDA Programming and Performance	2	468	August 17, 2023
CUDA Toolkit 3.2 release candidate available to registered developers CUDA Programming and Performance	68	63109	December 3, 2010
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134557	May 26, 2010
CUFFT_INVALID_PLAN after loading other ptx code with cuModuleLoadDataEx GPU-Accelerated Libraries	1	722	June 29, 2016

Help!! I can't get my NVidia GeForce GT 525M to load in a single CUDA PTX kernel!!

Related topics