i’ve a little problem while rewritting an application for the driver api. First of all the main reason for this conversion is the usage of cuCtxContext(). I’m programming a multithreading tool and therefore have to control, when a thread have access to a specified cuda context and they have to migrate from on thread to another. This thing really works, but now I get an error 2bd (means cuda driver error out of resources) when I launch the kernel.

I did not changed the parameters, same grid and block size (means the same smem size etc.) but as I said before, cuda returns the error.

I post my kernel call and maybe somebody has an idea what’s the problem is:

dim3 dimGrid(dimGX,dimGY,1);

dim3 dimBlock(dimBX,dimBY,1);

kernel1<<<dimGrid,dimBlock>>>(in , pitch, integer1 , integer2 , conf);

whereas the kernel call function is defined as:

__global__ void kernel1(unsigned int * in, size_t pitch, int integer1, int integer2, refinementConfiguration conf)

whereas refinementConfiguration is the following struct:

struct refinementConfig {

  int int1; 

  int int2; 

  float fl1;

  int int3;

  int int4; 

  int int5;


typedef struct refinementConfig refinementConfiguration;

kernel parameters and refinementConfig stay the same, but I’m using the driver api for invocation:

int offset=0;

cuParamSetv(cuFunction,offset,in,sizeof(unsigned int *));

cuFuncSetSharedSize(cuFunction,0); //There is no extern shared[] memory only exact defined in kernel

with ptxas-options=-v I get the following output for a stand-alone program which runs:

Used 32 registers, 60+36 bytes lmem, 15880+12808 bytes smem, 72 bytes cmem[1]

By the way I’ve an NVIDIA GTX280 with 1GB RAM

Has anyone an idea what’s the problem? Is there something I’ve forgotten? Could be my own struct a problem for the driver api? But do not forget, that the whole program runs with the runtime-api as stand-alone and integrated in another program.

looks like you ran out of shared memory, sm1.3 has 16K of shared mem while ptxas shows

that your kernel exceeds this limit …

Also register usage is far too high, GTX280 has 16K registers per SM

so if #threads per grid times #registers exceeds 16K then your kernel will fail to launch

Thanks for your reply.

I mean that’s the problem. This is the ptx output of stand-alone program, which has NO problem. I have only integrated this kernel in another program and when I use the runtime api everything works fine. BUT if i change from runtime-api to driver-api calls, the kernel fails…

I also reduced the shared-mem size by decreasing the block-size but the error remains the same…

Are there other reasons for this error? Maybe, is this error caused by another previous call? I don’t think so because, ALL calls are included in CU_SAFE_CALL(), but could it be possible?

best regards