Emulator works but G80 doesn't

So after much wrangling, I got my code working with the emulator, but when I re-compile it to run on the G80, I get nothing but zeros out. I’ve only got one kernel and it does a texture look up, distance calculation, couple of sin/cos lookups and then some multiplications and additions, absolutely nothing strange or special mathematically, it doesn’t even loop.

Everything else including grid and block size are the same between the Emulator/G80 versions. Though I do have some print statements wrapped in #ifdef DEVICE_EMULATION directives. What the heck is going on?

Are you checking for errors after the kernel call? With my code, I’ve found that running with too many threads per block will result in the kernel returning immediately and an “unspecified launch error” because their aren’t enough registers.

Is there a special way I have to check that, or is it something that would be printed to stdout? Because I’m certainly not seeing any error messages like that, it acts like it’s working fine, but then it returns just zeros as a result. Right now I shouldn’t have more than 256 threads per block.

If you are using the cutil code like the example projects, then you can just insert the macro invocation CUT_CHECK_ERROR(“useful message”). The example projects also wrap every cuda* call with CUDA_SAFE_CALL(). These macros are defined to be nothing in release mode (visual studio), but will work in debug mode.

If you aren’t using the cutil code like the example projects, then you should still be checking every cuda* call for error return values. Also lookup the functions cudaGetLastError() and cudaGetLastErrorString() in the programming guide.

Great that’s just what I was looking for, don’t know how I missed them in the programming guide. Though, when I put them in after my Kernel call I get these errors:

and then the “unspecified driver error” just keeps repeating. All the SDK examples seem to work just fine, so I don’t really know what I’m doing wrong. Here’s how I’m calling my Kernel:

 // Copy inputs to device

  cudaMemcpy((void**)d_roverPos, roverPos, 3*sizeof(float), cudaMemcpyHostToDevice);

  cudaMemcpy((void**)d_m01, m01, 2*sizeof(float), cudaMemcpyHostToDevice);

 d_cudaCalcFrame <<< *grid, *threads >>> (d_roverPos, d_m01, lambda, readBuff, writeBuff, width);

 // Print any errors that occured

  printf("%s\n", cudaGetErrorString(cudaGetLastError()));

Weird, I went back into my kernel and changed it from having a read buffer and a write buffer to having a single buffer and now it works shrug. Giving me totally different results than the emulation is though.

Use the occupancy calculator to help you determine how many threads can run given the # of registers your kernel requires. It will also help you optimize when things are working.

http://developer.download.nvidia.com/compu…_calculator.xls

To find out register usage compile your .cu files with “-cubin” instead of “-c” and you’re be able to read out some statistics.

Perhaps changing to a single buffer changed the number of registers used and that is now why it runs. Check the register usage as bbarran suggested. And it would also be helpful to know how many threads per block you are running.

As to getting different results on the card vs emu, I’ve seen this a few times in my development. Are you using any shared memory? When it has happened to me, it has been because I didn’t have the appropriate __syncthreads() calls around an updating of shared memory. It tends to still work OK in emu mode because the threads are run in a very sequential manner. The device interleaves threads so much more that any race conditions for shared memory are much more likely to show up. With a single in/out global array, you need to be careful for race conditions in accessing that memory too.

Isn’t this the same error you made here

Peter

Those cuda calls actually work fine with that syntax, I copied it right out of the programming guide.

I have a similar problem.

my code work well in emulation modus, but calculation on device return wrong result.

No error after kernel execution.

Does anyone have a suggestion how can I avoid the calculation error?

here is the kernel code:

__constant__ unsigned char key[32];

__shared__ unsigned char shared[32];

__global__ void kernel_mul(unsigned char *r, const unsigned char *q, const int len_q, const int len_p) {

	const unsigned int tid = threadIdx.x + threadIdx.y * blockDim.x;

	register unsigned int tmp, c, n, l;

	tmp = c = n = l = 0;

	register unsigned int base_index,k;

	base_index = k = __mul24(blockDim.x, tid)+len_p;

	for(int i=len_p-1;i>=0;i--) {

  tmp =  __mul24(q[tid], (int)key[i]) + c;

  shared[k] = tmp & 255;

  c = tmp >> 8 & 255;

  k--;

	}

	__syncthreads();

	shared[k] = c;

	c = 0;

	k=len_p+tid;

	for(int i=base_index; i>base_index-len_p-1; i--) {

  l=(shared[i]+shared[k])&255;

  shared[k]=l;

  c = (l < shared[i]);

  n = len_p+tid-1;

  while(c > 0) {

  	l=(shared[n]+c)&255;

  	c=(l<shared[n]);

  	shared[n]=l;

  	n--;

  }

  k--;

	}

	__syncthreads();

	r[tid] = shared[tid];

}

and here the part of function with kernel call:

unsigned char *p = (unsigned char*)malloc(l_p*sizeof(unsigned char));

	unsigned char *q = (unsigned char*)malloc(l_q*sizeof(unsigned char));

// init p and q

	cudaMemcpyToSymbol( key, p, sizeof(unsigned char) * size, 0);

	unsigned char* id_q;

	cudaMalloc((void**)&id_q, l_q*sizeof(unsigned char));

	cudaMemcpy(id_q, q, l_q*sizeof(unsigned char), cudaMemcpyHostToDevice);

	unsigned char *od_r;

	cudaMalloc((void**)&od_r, (l_p + l_q) * sizeof(unsigned char) );

	dim3 dimBlock(l_p + l_q);

	dim3 dimGrid(1);

	kernel_mul<<< dimGrid, dimBlock >>>(od_r, id_q, l_q, l_p);

	CUT_CHECK_ERROR("Kernel execution failed");

	unsigned char *tmp_r = (unsigned char*) malloc((l_p + l_q) * sizeof(unsigned char));

	cudaMemcpy(tmp_r, od_r, (l_p + l_q) * sizeof(unsigned char), cudaMemcpyDeviceToHost);

Hi there,

I experienced similar errors, but could resolve all of them in one or another way. So the
first step (if you are using the CUDA SDK) is to compile in DEBUG mode. All these nice
macros like CUDA_SAFE_CALL or CUT_CHECK_ERROR expand to nothing in release
mode. So when you encounter an zero-filled buffer after your kernel execution, it’s most
likely a failed kernel launch and in debug mode you’ll be able to read this (either a ‘launch
failure’ or an ‘unknown device error’ message).

A second typical mistake was about synchronizing. Make sure that whenever you are
synchronizing your threads with __syncthreads(), this part of your code MUST BE REACHED
BY ALL THREADS. So never put a sync in an if-branch or loop which does not run precisely
the same way for all threads!

The third mistake is also related to sync in combination with shared memory. Make sure
that whenever you write to shared memory, you do it in a controlled and synchronized way.
Reading may be done randomly. Even causing bank conflicts when reading costs only a
couple of clock cycles, nothing worse. But be careful when writing!

And one last mistake which does not cause the kernel to fail or crash, but will result in different
results in EMU and DEVICE modes: if you read or write beyond your shared memory buffers!

I hope that NVidia is going to offer Cuda programmers a way to indicate (signal, flags, whatever)
a kernel error. Something like setKernelError( errorCode ), which can be queried after kernel
execution. That would help a LOT in debugging errors in DEVICE mode.

@alik:
I noticed a couple of things in your code. Some will improve your performance (a lot!) and others
might resolve your troubles with having wrong results:

1- Your kernel execution will use only one of eight multiprocessors on the graphics card. Note
the following: every block is executed on ONE multiprocessor. This is because it might use
shared memory which is only shared between threads on one multiprocessor. If you have
more blocks, these are distributed over all available multiprocessors. So changing in your
case your block and grid such that you have at least 8 (better: 16 or more) blocks in your
grid might speed up your kernel by a factor of 8 (or more).

2- Shared memory access: You are using unsigned char (1 Byte) in your shared memory. The
shared memory consists of banks with 4 Byte elements. In your code you are most likely accessing
every bank element with four threads at the same time, causing write bank conflicts. The same
happens later with read access. Even if this is not causing any harm, it makes your shared
memory access 4 times slower. Try to use a stride of 4*threadIdx.x to access the shared memory
to have every thread access an individual bank. If you execute your kernel like with threads( 32, 4 ),
you could access SM[ threadIdx.x * 4 + threadIdx.y ], which should not result in any bank conflicts,
because different threadIdx.y are in different warps beeing executed at different times.

3- In your second loop, you read and write to shared memory without synchronizing after writes.
Are you sure that the data is ALWAYS consistent. Or could it happen that someone reads data
which has not yet been modified properly? This might not show up in the EMU version because
threads are executed differently from the way they run in DEVICE mode. To me, this part looks
very suspecious.

4- You did not specify what values blockDim.x and len_p are. But from the line where you compute
k = __mul( blockDim.x, tid ) + len_p;
it might be that you read beyond the specified 32 values of your shared memory. This might be my
number one guess why your code fails…

Hope this helps,
Jake