Odd behavior from varios Init()'s

So, I have the following code

int sum = 0;

//Create Arrays for CPU
float *cpuA;												//freed
float *cpuB;												//freed
float *cpuC;												//freed

//Create Arrays for the GPU
float *gpuA;												//freed
float *gpuB;												//freed
float *gpuC;												//freed

//Create Vectors for various functions
float *vector;												//freed
float *vector2;												//freed
float *meanVectorGPU;										//freed
float *meanVectorCPU;										//freed
//Create i, j for various loops
int i, j;

//Declare sizes for the arrays
int nRows = 8100;
int nColumns = 8100;

//Used for call to kernel, so that threads does not exceed 512
dim3 threads2(nColumns);
dim3 grid2(nColumns);
dim3 threads(nRows,nColumns);
dim3 grid(nRows,nColumns);
const dim3 dimBlock(1);
float divisor = ceil((float)nRows*(float)nColumns/256.0f)+1;
int dim = ceil(sqrt((float)(nColumns*nRows)/divisor));
const dim3 dimGrid(dim, dim);
//Create the items for the timer
unsigned int timer = 0;
unsigned int elapsed = 0;

//Initialize cutil

The problem I am having is that my code works wonderfully with CUT_DEVICE_INIT() commented out (I have more code below this, but this is where the error is). However, if I call CUT_DEVICE_INIT() then the size of my nColumns & nRows can only be 1581 each. The same thing happens if I use cublasInit() . I am wondering if it is because of the slow initialization time for each, causing certain portions of the code to attempt to be run before initialization is complete.

The code you left out matters, since all of your variable definitions here are unused. And what is “the error”? A compile error? Runtime crash? Simply incorrect results?

I suspect that you’re surpassing some of the device limits, especially in thread dimensions per block, but it’s impossible to tell seeing the invocation code.

I’m guessing you’re going to be calling a kernel with a thread dimension like 8100x8100, which is Way Past the limits. Look at A.1.1 in the programming guide, you can only have 512 threads per block.

No, I’m not calling a kernel like this. I am calling a kernel using the dimensions that I specified in the code, which are well within specified limits. I am getting a kernel error (.exe quits before I can fully read the text, but I’m pretty sure its a standard kernel error from

CUT_CHECK_ERROR("Kernel execution failed");

If I was calling my kernel incorrectly, it would make no sense why it would work without an init() for matrices as big as will fit in memory and why it wouldn’t with the init(). Is this a thread sync problem? i.e. If I call an init() do I have to explicitly tell it to wait for initialization to finish before calling my kernel? I would assume that this would be forced ‘behind the scenes’ but I’m not positive.