invalid device function runtime error occurs only when running on actual device


I’m relatively new to CUDA and have been trying a number of different small project.
Right now, I’m writing a multi-precision multiplication CUDA program.
I have a rudimentary program working under the emulator (I get correct results with no errors)
but when I run the kernel on the actual device, I get the error “Invalid device function” as the result of the kernel execution.
Furthermore, I get the following errors in the output window of visual studio:

NVAPI: bLhThunkInit: failed assert: lhThunk.pfnGetDisplayConfigBufferSizes
NVAPI: bLhThunkInit: failed assert: lhThunk.pfnQueryDisplayConfig
NVAPI: bLhThunkInit: failed assert: lhThunk.pfnSetDisplayConfig
NVAPI: bLhThunkInit: failed assert: lhThunk.pfnDisplayConfigGetDeviceInfo
NVAPI: bLhThunkInit: failed assert: lhThunk.pfnDisplayConfigSetDeviceInfo

Why does this happen?
What do these errors mean and how can I fix them?

I am running the program on a custom built Corei7 920 system with Windows Vista Business SP1, with a Nvidia GTX 285.
The CUDA SDK version is 2.3. The Driver version is 190.38.
I think the basic configuration is correct as I have been able to run various sample projects from nvidia without problems.

I would highly appreciate any helpful advice or reference to good information sources regarding the meaning of the error messages.

---------code excerpt------------

struct LongInteger12 {
unsigned int numLimbs;
unsigned int numAllocLimbs;
unsigned int * data;

global void Multiply (struct LongInteger12 * a, struct LongInteger12 * b, struct LongInteger12 * product) {
// Multiplies two long integers

// Assumes same lengths for all inputs
unsigned int j, carry = 0, sum, temp;
__shared__ unsigned int a_sh [16];
__shared__ unsigned int b_sh [16];
__shared__ unsigned int p_sh [32];
__shared__ unsigned int c_sh [32];

//if (threadIdx.x == 0) {
//	atomicAdd (&(product->numLimbs), a->numLimbs + b->numLimbs);
if (threadIdx.x == 1) {
	product->numLimbs = a->numLimbs + b->numLimbs;
// load data to shared memory
a_sh [threadIdx.x] = a->data[threadIdx.x];
b_sh [threadIdx.x] = b->data[threadIdx.x];
p_sh [threadIdx.x] = 0;
p_sh [threadIdx.x + b->numLimbs] = 0;
c_sh [threadIdx.x] = 0;
syncthreads ();
// do preliminary multiply
for (j=0; j<a->numLimbs; j++) {
	temp = carry + __umul24 (a_sh[j], b_sh[threadIdx.x]);
	syncthreads ();
	sum = p_sh [j + threadIdx.x] + temp;
	carry = (sum >> 12) & 0xFFFFF;
	p_sh [j + threadIdx.x] = sum & 0xFFF;
// Carry Propagation for Mulitply Stage
p_sh [threadIdx.x + b->numLimbs] += carry;

syncthreads ();
// Output data to product array
product->data [threadIdx.x] = p_sh [threadIdx.x];
product->data [threadIdx.x + b->numLimbs] = p_sh [threadIdx.x + b->numLimbs];



(Note: I understand that this code is not 100% correct…The above generates correct answers for most numbers but should output inaccurate results in rare cases. But this should not impact its ability to be run on the device. Also, I realize that this code only works on a pair of integers…it does not take advantage of multiple thread blocks at this point. I will make my code support all these things once I get the basics working. Currently, I just invoke the kernel with a 1x1 grid and a numLimbs x 1 thread block)