I experienced similar errors, but could resolve all of them in one or another way. So the
first step (if you are using the CUDA SDK) is to compile in DEBUG mode. All these nice
macros like CUDA_SAFE_CALL or CUT_CHECK_ERROR expand to nothing in release
mode. So when you encounter an zero-filled buffer after your kernel execution, it’s most
likely a failed kernel launch and in debug mode you’ll be able to read this (either a ‘launch
failure’ or an ‘unknown device error’ message).
A second typical mistake was about synchronizing. Make sure that whenever you are
synchronizing your threads with __syncthreads(), this part of your code MUST BE REACHED
BY ALL THREADS. So never put a sync in an if-branch or loop which does not run precisely
the same way for all threads!
The third mistake is also related to sync in combination with shared memory. Make sure
that whenever you write to shared memory, you do it in a controlled and synchronized way.
Reading may be done randomly. Even causing bank conflicts when reading costs only a
couple of clock cycles, nothing worse. But be careful when writing!
And one last mistake which does not cause the kernel to fail or crash, but will result in different
results in EMU and DEVICE modes: if you read or write beyond your shared memory buffers!
I hope that NVidia is going to offer Cuda programmers a way to indicate (signal, flags, whatever)
a kernel error. Something like setKernelError( errorCode ), which can be queried after kernel
execution. That would help a LOT in debugging errors in DEVICE mode.
I noticed a couple of things in your code. Some will improve your performance (a lot!) and others
might resolve your troubles with having wrong results:
1- Your kernel execution will use only one of eight multiprocessors on the graphics card. Note
the following: every block is executed on ONE multiprocessor. This is because it might use
shared memory which is only shared between threads on one multiprocessor. If you have
more blocks, these are distributed over all available multiprocessors. So changing in your
case your block and grid such that you have at least 8 (better: 16 or more) blocks in your
grid might speed up your kernel by a factor of 8 (or more).
2- Shared memory access: You are using unsigned char (1 Byte) in your shared memory. The
shared memory consists of banks with 4 Byte elements. In your code you are most likely accessing
every bank element with four threads at the same time, causing write bank conflicts. The same
happens later with read access. Even if this is not causing any harm, it makes your shared
memory access 4 times slower. Try to use a stride of 4*threadIdx.x to access the shared memory
to have every thread access an individual bank. If you execute your kernel like with threads( 32, 4 ),
you could access SM[ threadIdx.x * 4 + threadIdx.y ], which should not result in any bank conflicts,
because different threadIdx.y are in different warps beeing executed at different times.
3- In your second loop, you read and write to shared memory without synchronizing after writes.
Are you sure that the data is ALWAYS consistent. Or could it happen that someone reads data
which has not yet been modified properly? This might not show up in the EMU version because
threads are executed differently from the way they run in DEVICE mode. To me, this part looks
4- You did not specify what values blockDim.x and len_p are. But from the line where you compute
k = __mul( blockDim.x, tid ) + len_p;
it might be that you read beyond the specified 32 values of your shared memory. This might be my
number one guess why your code fails…
Hope this helps,